Chapter 17. Network Performance Analysis

17.1. Network congestion and network interfaces

A network that was designed to ensure transparent access to filesystems and to provide "plug-and-play" services for new clients is a prime candidate for regular expansion. Joining several independent networks with routers, switches, hubs, bridges, or repeaters may add to the traffic level on one or more of the networks. However, a network cannot grow indefinitely without eventually experiencing congestion problems. Therefore, don't grow a network without planning its physical topology (cable routing and limitations) as well as its logical design. After several spurts of growth, performance on the network may suffer due to excessive loading.

The problems discussed in this section affect NIS as well as NFS service. Adding network partitioning hardware affects the transmission of broadcast packets, and poorly placed bridges, switches, or routers can create new bottlenecks in frequently used network "virtual circuits." Throughout this chapter, the emphasis will be on planning and capacity evaluation, rather than on low-level electrical details.

17.1.1. Local network interface

Ethernet cabling problems, such as incorrect or poorly made Category-5 cabling, affect all of the machines on the network. Conversely, a local interface problem is visible only to the machine suffering from it. An Ethernet interface device driver that cannot handle the packet traffic is an example of such a local interface problem.

The netstat tool gives a good indication of the reliability of the local physical network interface:

% netstat -in 
Name Mtu  Net/Dest     Address      Ipkts   Ierrs  Opkts   Oerrs Collis Queue  
lo0  8232 127.0.0.0    127.0.0.1    7188    0      7188     0     0      0     
hme0 1500 129.144.8.0  129.144.8.3  139478  11     102155   0     3055   0

The first three columns show the network interface, the maximum transmission unit (MTU) for that interface, and the network to which the interface is connected. The Address column shows the local IP address (the hostname would have been shown had we not specified -n). The last five columns contain counts of the total number of packets sent and received, as well as errors encountered while handling packets. The collision count indicates the number of times a collision occurred when this host was transmitting.

Input errors can be caused by:

Malformed or runt packets, damaged on the network by electrical problems.
Bad CRC checksums, which may indicate that another host has a network interface problem and is sending corrupted packets. Alternatively, the cable connecting this workstation to the network may be damaged and corrupting frames as they are received.
The device driver's inability to receive the packet due to insufficient buffer space.

A high output error rate indicates a fault in the local host's connection to the network or prolonged periods of collisions (a jammed network). Errors included in this count are exclusive of packet collisions.

Ideally, both the input and output error rates should be as close to zero as possible, although some short bursts of errors may occur as cables are unplugged and reconnected, or during periods of intense network traffic. After a power failure, for example, the flood of packets from every diskless client that automatically reboots may generate input errors on the servers that attempt to boot all of them in parallel. During normal operation, an error rate of more than a fraction of 1% deserves investigation. This rate seems incredibly small, but consider the data rates on a Fast Ethernet: at 100 Mb/sec, the maximum bandwidth of a network is about 150,000 minimum-sized packets each second. An error rate of 0.01% means that fifteen of those 150,000 packets get damaged each second. Diagnosis and resolution of low-level electrical problems such as CRC errors is beyond the scope of this book, although such an effort should be undertaken if high error rates are persistent.

17.1.2. Collisions and network saturation

Ethernet is similar to an old party-line telephone: everybody listens at once, everybody talks at once, and sometimes two talkers start at the same time. In a well-conditioned network, with only two hosts on it, it's possible to use close to the maximum network's bandwidth. However, NFS clients and servers live in a burst-filled environment, where many machines try to use the network at the same time. When you remove the well-behaved conditions, usable network bandwidth decreases rapidly.

On the Ethernet, a host first checks for a transmission in progress on the network before attempting one of its own. This process is known as carrier sense. When two or more hosts transmit packets at exactly the same time, neither can sense a carrier, and a collision results. Each host recognizes that a collision has occurred, and backs off for a period of time, t, before attempting to transmit again. For each successive retransmission attempt that results in a collision, t is increased exponentially, with a small random variation. The variation in back-off periods ensures that machines generating collisions do not fall into lock step and seize the network.

As machines are added to the network, the probability of a collision increases. Network utilization is measured as a percentage of the ideal bandwidth consumed by the traffic on the cable at the point of measurement. Various levels of utilization are usually compared on a logarithmic scale. The relative decrease in usable bandwidth going from 5% utilization to 10% utilization, is about the same as going from 10% all the way to 30% utilization.

Measuring network utilization requires a LAN analyzer or similar device. Instead of measuring the traffic load directly, you can use the average collision rate as seen by all hosts on the network as a good indication of whether the network is overloaded or not. The collision rate, as a percentage of output packets, is one of the best measures of network utilization. The Collis field in the output of netstat -in shows the number of collisions:

% netstat -in 
Name Mtu  Net/Dest     Address      Ipkts   Ierrs  Opkts   Oerrs Collis Queue  
lo0  8232 127.0.0.0    127.0.0.1    7188    0      7188     0     0      0     
hme0 1500 129.144.8.0  129.144.8.3  139478  11     102155   0     3055   0

The collision rate for a host is the number of collisions seen by that host divided by the number of packets it writes, as shown in Figure 17-1.

Figure 17-1. Collision rate calculation

Collisions are counted only when the local host is transmitting; the collision rate experienced by the host is dependent on its network usage. Because network transmissions are random events, it's possible to see small numbers of collisions even on the most lightly loaded networks. A collision rate upwards of 5% is the first sign of network loading, and it's an indication that partitioning the network may be advisable.


16.5. Server tuning		17.2. Network partitioning hardware

Chapter 17. Network Performance Analysis

Contents:

17.1. Network congestion and network interfaces

17.1.1. Local network interface

17.1.2. Collisions and network saturation

Figure 17-1. Collision rate calculation