This chapter outlines several strategies that show how these tools can be used together. When troubleshooting, your approach should be to look first at the specific task and then select the most appropriate tool(s) based on the task. I do not describe the details of using the tools or show output in this chapter. You should already be familiar with these from the previous chapters. Rather, this chapter focuses on the selection of tools and the overall strategy you should take in using them. If you feel confident in your troubleshooting skills, you may want to skip this chapter.
For truly difficult problems, you will need to become formal and systematic. A somewhat general, standard series of steps you can go through follows, along with a running example. Keep in mind, this set of steps is only a starting point.
[41]Compromised hosts are a special problem requiring special responses. Documentation can be absolutely essential, particularly if you are contemplating legal action or have liability concerns. Documentation used in legal actions has special requirements. For more information you might look at Simson Garfinkel and Gene Spafford's Practical UNIX & Internet Security or visit http://www.cert.org/nav/recovering.html.Depending on your circumstances, management may require a written report. Even if this isn't the usual practice, if an outage becomes prolonged or if there are other consequences, it might become necessary. This is particularly true if there are some legal consequences of the problem. An accurate log can be essential in such cases.
If you have a complex problem, you are likely to forget at some point what you have actually done. This often means starting over. It can be particularly frustrating if you appear to have found a solution, but you can't remember exactly what you did. A seemingly insignificant step may prove to be a key element in a solution.
As you identify symptoms, try to expand and clarify the problem. If the problem was reported by someone else, then you will want to try to recreate the problem so that you can observe the symptoms directly. Keep in mind, if you can't recognize normal behavior, you won't be able to recognize anomalous behavior. This has been a recurring theme in this book and a reason you should learn how to use these tools before you need them.
As an example, the first indication of a problem might be a user complaining that she cannot telnet from host bsd1 to host lnx1. To expand and clarify the problem, you might try different applications. Can you connect using ftp ? You might look to see if bsd1 and lnx1 are on the same network or different networks. You might see if lnx1 can reach bsd1. You might include other local and remote hosts to see the extent of the problem.
Your problem definition may go through several refinements. Continuing with the previous problem, you might, over time, generate the following series of problem definitions:
It is natural to try to define the problem as quickly as possible, but you shouldn't be too tied to your definition. Try to keep an open mind and be willing to redefine your problem as your information changes.
In this example, we have worked outward from one system to include a number of systems. Usually troubleshooting tries to narrow the scope of the problem, but as seen from this example, in networking just the opposite may happen. You must discover the full scope of the problem before you can narrow your focus. In this running example, realizing that remote connections could connect was a key discovery.
In general, you want tests that will reduce the size of the search space (i.e., identify subsystem involved), that are easy to apply, that do not create further problems, and so on.
In our running example, a necessary first step in making a connection is doing address resolution. This suggests that there might be some problem with the ARP mechanism. Notice that this is not a full hypothesis, but rather a point of further investigation. Having expanded the scope of the problem, we are attempting to focus in on subsystems to reduce the problem. Also notice that I haven't used any fancy tools up to this point. Keep it simple as long as you can.
Returning to our example, there are several ways we could investigate whether the ARP mechanism is functioning correctly. One way would be to use tcpdump or ethereal to capture traffic on the network to see if the ARP requests and responses are present. A simpler test, however, is to use the arp command to see if the appropriate entries are in the ARP cache on the hosts that are trying to connect to lnx1. In this instance, it was observed that the entries were missing from all the hosts attempting to connect to lnx1. The exception was the router on the network that had a much longer cache timeout than did the local hosts. This also explained why remote hosts could connect but local hosts could not connect. The remote hosts always went through the router, which had cached the Ethernet address bypassing the ARP mechanism. Note that this was not a definitive test but was done first because it was much easier.
With our extended example, two additional tests were possible. One was to manually add the address of lnx1 to bsd1's ARP table using the arp command. When this was done, connectivity was restored. When the entry was deleted, connectivity was lost. A more revealing but largely unnecessary test using packet-capture software to watch the exchange of packets between the bsd1 and lnx1 revealed that bsd1's ARP requests were being ignored by lnx1.
With our running problem, this was not necessary. Connectivity was fully restored when the system was rebooted. What caused the problem? That was never fully resolved, but since the problem never recurred, it really isn't an issue.
If restarting the system hadn't solved the problem, what would have been the next step? In this case, the likely problem was corrupted system software. If you are running an integrity checker like tripwire, you might try locating anything that has changed and do a selective reinstallation. Otherwise, you may be faced with reinstalling the operating system.
11.5. Microsoft Windows | 12.2. Task-Specific Troubleshooting |
Copyright © 2002 O'Reilly & Associates. All rights reserved.