When debugging a network problem, it's important to think about the potential cause of a problem, and then use that to start ruling out other factors. For example, if your attempts to bind to an NIS server are failing, you should know that you could try testing the network using ping, the health of ypserv processes using rpcinfo, and finally the binding itself with ypset. Working your way through the protocol layers ensures that you don't miss a low-level problem that is posing as a higher-level failure. Keeping with that advice, we'll start by looking at a network layer problem.
or:NFS server muskrat not responding still trying
ypbind: NIS server not responding for domain "techpubs"; still trying
The very sporadic nature of the problem -- and the fact that it resolved itself over time -- pointed toward a problem with ARP request and reply mismatches. This hypothesis neatly explained the extraordinarily slow loading of the application: a client machine trying to read the application executable would do so by issuing NFS Version 2 requests over UDP. To send the UDP packets, the client would ARP the server, randomly get the wrong reply, and then be unable to use that entry for several minutes. When the ARP table entry had aged and was deleted, the client would again ARP the server; if the correct ARP response was received then the client could continue reading pages of the executable. Every wrong reply received by the client would add a few minutes to the loading time.
There were several possible sources of the ARP confusion, so to isolate the problem, we forced a client to ARP the server and watched what happened to the ARP table:
By deleting the ARP table entry and then directing the client to send packets to muskrat, we forced an ARP of muskrat from the client. ping timed out without receiving any ICMP echo replies, so we examined the ARP table and found a surprise:# arp -d muskrat muskrat (139.50.2.1) deleted # ping -s muskrat PING muskrat: 56 data bytes No further output from ping
Since muskrat was a Sun workstation, we expected its Ethernet address to begin with 08:00:20 (the prefix assigned to Sun Microsystems), not the 08:00:49 prefix used by Kinetics gateway boxes. The next step was to figure out how the wrong Ethernet address was ending up in the ARP table: was muskrat lying in its ARP replies, or had we found a network imposter?# arp -a | fgrep muskrat le0 muskrat 255.255.255.255 08:00:49:05:02:a9
Using a network analyzer, we repeated the ARP experiment and watched ARP replies returned. We saw two distinct replies: the correct one from muskrat, followed by an invalid reply from the Kinetics FastPath gateway. The root of this problem was that the Kinetics box had been configured using the IP broadcast address 0.0.0.0, allowing it to answer all ARP requests. Reconfiguring the Kinetics box with a non-broadcast IP address solved the problem.
The last update to the ARP table is the one that "sticks," so the wrong Ethernet address was overwriting the correct ARP table entry. The Kinetics FastPath was located on the other side of the bridge, virtually guaranteeing that its replies would be the last to arrive, delayed by their transit over the bridge. When muskrat was heavily loaded, it was slow to reply to the ARP request and its ARP response would be the last to arrive. Reconfiguring the Kinetics FastPath to use a proper IP address and network mask cured the problem.
ARP servers that have out-of-date information create similar problems. This situation arises if an IP address is changed without a corresponding update of the server's published ARP table initialization, or if the IP address in question is re-assigned to a machine that implements the ARP protocol. If an ARP server was employed because muskrat could not answer ARP requests, then we should have seen exactly one ARP reply, coming from the ARP server. However, an ARP server with a published ARP table entry for a machine capable of answering its own ARP requests produces exactly the same duplicate response symptoms described above. With both machines on the same local network, the failures tend to be more intermittent, since there is no obvious time-ordering of the replies.
There's a moral to this story: you should rarely need to know the Ethernet address of a workstation, but it does help to have them recorded in a file or NIS map. This problem was solved with a bit of luck, because the machine generating incorrect replies had a different manufacturer, and therefore a different Ethernet address prefix. If the incorrectly configured machine had been from the same vendor, we would have had to compare the Ethernet addresses in the ARP table with what we believed to be the correct addresses for the machine in question.
14.7. Time synchronization | 15.2. Renegade NIS server |
Copyright © 2002 O'Reilly & Associates. All rights reserved.