Linux NET-2/3-HOWTO: Diagnostic tools - How do I find out what is wrong?

14. Diagnostic tools - How do I find out what is wrong?

In this section I'll briefly describe some of the commonly used diagnostic tools that are available for your Linux network and how you might use them to identify the cause of your network problems, or to teach yourself a bit more about how tcp/ip networking works. I'll gloss over some of the detail of how the tools work because this document is not an appropriate forum for describing that sort of detail, but I hope I'll have presented enough information that you'll have an understanding of how to use the tool and to better understand the relevant man page or other documentation.

14.1 ping - are you there?

The ping tool is located in the NetKit-B distribution as detailed above in the `Network Applications' section. ping, as the name implies, allows you to transmit a datagram at another host that it will reflect back at you if it is alive and working ok and the network in between is also ok. In its simplest form you would simply say:


# ping gw
PING gw.vk2ktj.ampr.org (44.136.8.97): 56 data bytes
64 bytes from 44.136.8.97: icmp_seq=0 ttl=254 time=35.9 ms
64 bytes from 44.136.8.97: icmp_seq=1 ttl=254 time=22.1 ms
64 bytes from 44.136.8.97: icmp_seq=2 ttl=254 time=26.0 ms
^C

--- gw.vk2ktj.ampr.org ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 22.1/28.0/35.9 ms
#

What ping has done is resolved the hostname to an address and using the icmp protocol has transmitted an icmp echo request datagram to the remote host periodically. For each echo request that the remote host receives it will formulate an icmp echo reply datagram which it will transmit back to you. Each line beginning with `64 bytes from ...' represents an echo reply received in response to an echo request. Each line tells you the address of the host that sent you the reply, the sequence number to which the reply was for, the time to live field and the total round trip time that was taken. The round trip time is the time between when the echo request datagram is transmitted and the corresponding echo reply is received. This can be used as a measure of how fast or slow the network connection between the two machines is.

The last two lines tell you how many datagrams were transmitted, how many valid responses were received and what percentage of the datagrams were lost. The percentage lost figure is a measure of how good or error free the network connection is. High percentage lost figures indicate such problems as a high error rate on a link somewhere between the hosts, exhausted capacity on a router or link somewhere, or high collision rate on an ethernet lan. You can use ping to identify where this problem might be by running ping sessions to each of the routed points that make up the network path. When you find that you can ping somewhere without any datagram loss, but pinging anywhere past there causes you packet loss, you can deduce that the problem lies somewhere between those two points.

14.2 traceroute - How do I get there?

The traceroute tool is found in the NetKit-A distribution detailed earlier. traceroute is primarily used for testing and displaying the path that your network connection would take to a destination host. traceroute also uses the icmp protocol, but it uses a clever trick to get each point along the path to send it back a reply as it creeps its way along. Its trick is to manually manipulate the time to live field of the datagrams it transmits. The time to live field is a mechanism that ensures that rogue datagrams do not get caught in a routing loop. Each time a datagram passes through a router it decrements the time to live field by one. If the time to live reaches zero then that router or host sends an icmp time to live expired message back to the host who transmitted the datagram to let it know the datagram has expired. traceroute uses this mechanism by sending a series of udp datagrams with the time to live beginning set at one and incrementing each step it takes. By recording the addresses from the icmp time to live expired replies it receives in response to the datagrams dying it can determine the path taken to get to the destination. An example of its use would look something like:


# traceroute minnie.vk1xwt.ampr.org
traceroute to minnie.vk1xwt (44.136.7.129), 30 hops max, 40 byte packets
 1  gw (44.136.8.97)  51.618 ms  30.431 ms  34.396 ms
 2  gw.uts (44.136.8.68) 2017.322 ms  2060.121 ms 1997.793 ms
 3  minnie.vk1xwt (44.136.7.129) 2205.335 ms  2319.728 ms  2279.643 ms
#

The first column tells us how many hops away (what the ttl value was), the second column is the hostname and address that responded if it could be resolved or just its address if it could not. The third, fourth and fifth columns are the round trip time for three consecutive datagrams to that point. This tells us that the first hop in the network route is via gw.vk2ktj and the three figures following are the round trip times to that router. The next hop was via gw.uts.ampr.org and minnie.vk1xwt.ampr.org is one hop further away. You can deduce information about the network route by looking at the difference in times between each step in the route. You can see that the round trip times to gw are fairly fast, it is an ethernet connected host. gw.uts is substantially slower to get to than gw, it is across a low speed radio link, so you have the ethernet time plus the radio link time added together. minnie.vk1xwt is only slightly slower than gw.uts, they are connected via a high speed network.

If you perform a traceroute and you see the string !N appear after the time figure, this indicates that your traceroute program received a network unreachable response. This message tells you that the host or router who sent you the message did not know how to route to the destination address. This normally indicates that there is a network link down somewhere. The last address listed is as far as you get before you find the faulty link.

Similarly if you see the string !H this indicates that a host unreachable message has been received. This might suggest that you got as far as the ethernet that the remote host is connected to, but the host itself is not responding or is faulty.

14.3 tcpdump - capturing and displaying network activity.

Adam Caldwell <[email protected]> has ported the tcpdump utility to linux. tcpdump allows you to take traces of network activity by intercepting the datagrams on their way in and out of your machine. This is useful for diagnosing difficult to identify network problems.

You can find the source and binaries at: 103mor2.cs.ohiou.edu

tcpdump decodes each of the datagrams that it intercepts and displays them in a slightly cryptic looking format in text. You would use tcpdump if you were trying to diagnose a problem like protocol errors, or strange disconnections, as it allows you to actually see what has happened on the network. To properly use tcpdump you would need some understanding of the protocols and how they work, but it is useful for simpler duties such as ensuring that datagrams are actually leaving your machine on the correct port if you are trying to diagnose routing problems and for seeing if you are receiving datagrams from remote destinations.

A sample of tcpdump output looks like this:


# tcpdump -i eth0
tcpdump: listening on eth0
13:51:36.168219 arp who-has gw.vk2ktj.ampr.org tell albert.vk2ktj.ampr.org
13:51:36.193830 arp reply gw.vk2ktj.ampr.org is-at 2:60:8c:9c:ec:d4
13:51:37.373561 albert.vk2ktj.ampr.org > gw.vk2ktj.ampr.org: icmp: echo request
13:51:37.388036 gw.vk2ktj.ampr.org > albert.vk2ktj.ampr.org: icmp: echo reply
13:51:38.383578 albert.vk2ktj.ampr.org > gw.vk2ktj.ampr.org: icmp: echo request
13:51:38.400592 gw.vk2ktj.ampr.org > albert.vk2ktj.ampr.org: icmp: echo reply
13:51:49.303196 albert.vk2ktj.ampr.org.1104 > gw.vk2ktj.ampr.org.telnet: S 700506986:700506986(0) win 512 <mss 1436>
13:51:49.363933 albert.vk2ktj.ampr.org.1104 > gw.vk2ktj.ampr.org.telnet: . ack 1103372289 win 14261
13:51:49.367328 gw.vk2ktj.ampr.org.telnet > albert.vk2ktj.ampr.org.1104: S 1103372288:1103372288(0) ack 700506987 win 2048 <mss 432>
13:51:49.391800 albert.vk2ktj.ampr.org.1104 > gw.vk2ktj.ampr.org.telnet: . ack 134 win 14198
13:51:49.394524 gw.vk2ktj.ampr.org.telnet > albert.vk2ktj.ampr.org.1104: P 1:134(133) ack 1 win 2048
13:51:49.524930 albert.vk2ktj.ampr.org.1104 > gw.vk2ktj.ampr.org.telnet: P 1:28(27) ack 134 win 14335

 ..
#

When you start tcpdump without arguments it grabs the first (lowest numbered) network device that is not the loopback device. You can specify which device to monitor with a command line argument as shown above. tcpdump then decodes each datagram transmitted or received and displays them, one line each, in a textual form. The first column is obviously the time the datagram was transmitted or received. The remainder of the line is then dependent on the type of datagram. The first two lines in the sample are what an arp request from albert.vk2ktj for gw.vk2ktj look like. The next four lines are two pings from albert.vk2ktj to gw.vk2ktj, note that tcpdump actually tells you the name of the icmp datagram transmitted or received. The greater-than (>) symbol tells you which way the datagram was transmitted, that is, from who, to who. It points from the sender, to the receiver. The remainder of the sample trace are the establishment of a telnet connection from albert.vk2ktj to gw.vk2ktj.

The number or name at the end of each hostname tells you what socket number is being used. tcpdump looks in your /etc/services file to do this translation.

tcpdump explodes each of the fields and so you can see the values of the window and mss parameters in some of the datagrams.

The man page documents all of the options available to you.

Note for PPP users: The version of tcpdump that is currently available does not support the PPP suite of protocols. Al Longyear has produced a pair of patches to correct this, but these have not been built into a tcpdump distribution yet. The patch files are located in the same directory on sunsite.unc.edu as the tcpdump package.

14.4 icmpinfo - logs icmp messages received.

ICMP then Internet Control Message Protocol conveys useful information about the health of your IP network. Often ICMP messages are received and acted on silently with you never knowing of their presence. icmpinfo is a tool that will allow you to view ICMP messages much like tcpdump does. Laurent Demailly <[email protected]> took the bsd ping source and modified it heavily.

Version 1.10 is available from:

hplyot.obspm.fr


/net/icmpinfo-1.10.tar.gz

Compilation is as simple as:


# cd /usr/src
# cd icmpinfo-1.10
# gzip -dc icmpinfo-1.10.tar.gz | tar xvf -
# make

You must be root to run icmpinfo. icmpinfo can either decode to the tty it was called from or send its output to the syslog utility.

To test out how it works, try running icmpinfo and starting a traceroute to a remote host. You will see the icmp messages that traceroute uses listed on the output.