Infodoc ID |
|
Synopsis |
|
Date |
2241 |
|
Diskless Boot Troubleshooting Guide |
|
13 Apr 1995 |
Diskless 4.X Boot Troubleshooting Guide
SOLUTION SUMMARY:
4.x Diskless Boot Procedure
The following usage is used throughout the following
'ethers', 'hosts', and 'bootparams' are used to refer to
/etc/ethers, /etc/hosts, and /etc/bootparams if YP is not
running on the server. Otherwise, they refer to the
YP maps if YP is running.
To further explain the boot process, I will use the
following as an example :
Hostname IP_address Hex_IP Ethernet_address
------------------------------------------------------------
Server: batman 129.145.30.15 81911E0F 8:0:20:6:d4:f5
Client: penguin 129.145.30.27 8191271B 8:0:20:7:6:b9
============================================================
There are 3 distinct procedures which take place during a
4.x diskless client bootup.
A. The RARP Process (Reverse Address Resolution Protocol)
During this process the client broadcasts its
48 bit ethernet address to its local network.
Any system running rarpd, and also has the client's
ethernet address in 'ethers' , will then take the
hostname extracted from ethers and look up that hostname
in 'hosts'. If the host is found, that system will return
the IP address associated with the host back to the client.
The client will then say "Using IP address xxxx = xxxx"
(Except on sun4c clients, which are silent at this point).
B. The TFTP Process (Trivial File Transfer Protocol)
The client now uses the Hexadecimal representation of
its IP address to issue a tftp request across the net.
The server must have the same Hexadecimal number in its
/tftpboot directory as a symbolic link to boot.`arch -k`
where `arch -k` is one of {sun3, sun4, sun4c, or sun3x}
i.e., lrwxrwxrwx 1 root wheel 10 Sep 26 13:31 81911E1B ->
boot.sun4c
The server must also have the following link in the
/tftpboot directory:
lrwxrwxrwx 1 root wheel 1 Jul 21 1989 tftpboot -> .
The server must also have tftp uncommented in the files;
/etc/inetd.conf and
/etc/services or YP 'services' map if running YP.
Once the client successfully finds the boot file from the
server, it downloads it into its local memory. Then the
boot prom executes the bootfile just downloaded sending
out an RPC bootparam request to the network.
C. The Bootparam Process
Any system on the same net as the client which is running
rpc.bootparamd, and also has 'bootparams' info for the
client, will respond to the client which system and
path to NFS mount its root and swap file systems from.
The client will then attempt to mount its root and swap
file systems from the server defined from bootparams.
The server must have the following /etc/exports entry for
the client's root and swap, and have run 'exportfs -a'
/export/root/penguin -root=penguin,access=penguin
/export/swap/penguin -root=penguin,access=penguin
(This can be verified by running exportfs with no options.)
1. What the client console should look like when booting :
>b le()
Boot: le(0,0,0)
Using IP Adress 129.145.30.27 = 81911E1B
Booting from tftp server at 129.145.30.15 = 81911E0F
action here: Spinning propeller .... -/|\-/|\|/-\| ,
(** or ** incrementing numbers in a box on Sun4c clients )
Downloaded xxxxx bytes from tftp server.
Using IP Address 129.145.30.27 = 81911E1B
hostname: penguin
domainname: gotham
server name 'batman'
rootpathname '/export/root/penguin'
root on batman:/export/root/penguin fstype nfs
Boot: vmunix
Size: #####+#####+#####
<normal boot follows>
......
......
......
2. The Client<->Server Dialog <Problems/What to look for/What to do>
A. What can go wrong during the RARP stage ?
- Any blank lines or trailing spaces on lines in 'ethers' file
will cause RARP to fail.
- Any leading 0s between colons in 'ethers' will cause RARP to fail.
Correct : 8:0:20:7:6:b9 penguin
Incorrect: 8:00:20:07:06:b9 penguin
- Uppercase hostnames for clients will cause RARP to fail.
Lookup in 'ethers' succeeds, but gethostbyname() converts uppercase
to lowercase, then looks up the lowercase name in 'hosts' and can't
find it.
If running YP, the makedbm for hosts.{byname,byaddr} can be modified
in /var/yp/Makefile to use the '-l' option to convert uppercase to
lowercase.
- If nit, pf, nbuf, or clone are commented out of the server's kernel,
rarpd will fail to run. The following lines must be included in the
server's kernel:
pseudo-device snit # streams NIT
pseudo-device pf # packet filter
pseudo-device nbuf # NIT buffering module
pseudo-device clone
2. The Client<->Server Dialog <Problems/What to look for/What to do>
(CONT.)
B. What can go wrong during the TFTP stage?
- 'tftp timeout' :
This is a common error when tftp is commented out
of either /etc/inetd.conf or /etc/services.
This error can also occur if the hexadecimal
representation of the client's IP address is
missing or incorrect in /tftpboot, or if the
tftpboot -> . link is missing in /tftpboot.
Lowercase hexadecimal characters in the boot
file link will also cause this failure
ie.: incorrect -> 8191271b instead of
correct -> 8191271B.
- 'file not found' :
tftp is not able to find the boot.`arch -k`
file. This is common with 4/60 and 3/80 clients,
when setup_exec has not been run for sun4c or
sun3x respectively.
- panic ......... :
If setup_client was specified with the
wrong architecture (i.e., sun4 instead of sun4c)
it will probably panic when the boot prom
tries to execute the boot file.
C. What can go wrong during the Bootparam stage?
- bad dialog with bootparam server :
This error is common if there is a third party system on
the same network as the client, and that third party machine
is also running RPC. The third party system mayrespond to
the client's bootparam request with an RPC_FAILED response
back to the client before the real server can respond with
RPC_SUCCESS reply.
The diskless_boot_hang patch for bug #1018791 will usually
fix this problem, however, in cases where it doesn't, it
may be necessary for the customer to upgrade their bootprom
to revision 3.0 or greater. This patch is also for slow
booting problems.
- If client's NFS server as listed in bootparams is down or
on another network, client will see on the console:
hostname: penguin
domainname: gotham
server name 'batman'
/
Requesting Ethernet address for 129.145.30.15 = 81911E0F
C. What can go wrong during the Bootparam stage? (Continued)
- NFS error 13:
Thisis an NFS write error message, which usually
indicates that the client does not have root access
for the indicated file system. This usually happens
when the server is exporting /export, as well as
/export/root/client and /export/swap/client. If
/export is already exported then server will not be
able to export subdirectories also.
- null hostname returned from bootparam server - or -
- null domain returned from bootparam server :
These error messages indicate that there is probably
a Silcon Graphics system on the same network as the
client. If this is the case, have the customer kill
the rpc.bootparamd on the Silicon Graphics systems if
they are not boot servers. If they are boot servers,
then they can anonymous ftp to sgi.sgi.com, cd to
sgi/src, change to binary mode and get rpc.bootparamd.Z,
or contact Silicon Graphics Support hotline at:
1-800-345-0222 for the SGI bootparamd patch.
- clntkudp_callit retries exhausted:
This can come from an extremely busy network, or
from name lookup problems created by using a libc.so
shared library with name resolver routines built-in.
- bp_getclntent failed , bp_getclntkey failed :
These messages will pop up if the name of the client
in 'bootparams' is not the first hostname listed after
the IP address in 'hosts'. This usually happens when
client is in a nameserver domain. Also be aware of
servers which are using the libc.so library with
resolver routines built-in. This library bypasses both
NIS(YP) and /etc/hosts, and looks only at the nameserver
hosts database (i.e., hostname=penguin, but entry in 'hosts'
is 129.145.30.27 penguin.sun.com penguin #).
C. What can go wrong during the Bootparam stage? (Continued)
- "whoami RPC call failed with status #
\ panic: vfs_mountroot: cannot mount root":
There is a bug in a DEC product that causes these symptoms.
So, with a VMS VAX on the Ethernet AND running 4.0, if
the booting problems described here occur, find out from
the system manager if something called the "ULTRIX BRIDGE"
is being run. ULTRIX BRIDGE is DEC's own version of the
Wollongong package that allows VMS machines to use REAL datacomm
protocols. At any rate, DEC has a patch for this software.
3. SparcStation1 (4/60) anomaly:
The default client vmunix will not boot properly. Remake the
client kernel.
To remake SS1 client kernel (need to be root):
# cd /usr/share/sys/sun4c/conf
# cp GENERIC CLIENT
# vi CLIENT
change => config vmunix swap generic
to => config vmunix root on type nfs swap on type nfs
# config CLIENT
#cd ../CLIENT ; make ; cp vmunix /export/root/penguin/vmunix
Reboot the Sparcstation1 client.
3.1 3/80 anomaly :
The DL80 config file in 4.0.3 has the following:
% more /usr/share/sys/sun4C/conf/DL80
config vmunix root on nfs
which should be:
% more /usr/share/sys/sun4C/conf/DL80
config vmunix root on type nfs swap on type nfs
4. Booting off of Server's 2nd Ethernet (ie1, le1, ..) NOT SUPPORTED!!
Suppose server has two interfaces :
ie0 = batman
ie1 = batman-gw
Server must run 'rarpd ie1 batman-gw'.
Server's 'bootparams' must look like :
% more /etc/bootparams
penguin root=batman-gw:/export/root/penguin
swap=batman-gw:/export/swap/penguin
Client's fstab ( /export/root/penguin/etc/fstab ) needs to have:
% more /etc/fstab
batman-gw:/export/root/penguin / nfs rw 0 0
batman-gw:/export/exec/sun4c /usr nfs ro 0 0
batman-gw:/export/exec/kvm/sun4c /usr/kvm nfs ro 0 0
batman-gw:/export/share /usr/share nfs ro 0 0
batman-gw:/home/penguin /home/penguin nfs rw 0 0
5. Troubleshooting/Debugging the Client Boot Process
A. During RARP stage
On the server :
% ps ax | grep rarpd
{Two rarpd processes should appear for each interface.
Make a noteof the lowest # PID.}
% trace -p PID_of_lowest_rarpd
{Some output should appear when the client broadcasts its
ethernet address, including something like:
open ("/etc/hosts", 0, 0666) = 5}
AND/OR
% kill -9 both_rarpd_pids
restart rarpds with the debug [-d] option i.e.:
% rarpd -d if# hostname (4.1 usage : 'rarpd -a -d')
On a third system run etherfind :
% etherfind -rarp -o -broadcast
{This should be seen from a normal rarp request.}
Using interface le0
icmp type
lnth proto source destination src port dst port
60 rarp old-broadcast old-broadcast
B. During the TFTP stage
On a third system run etherfind:
% etherfind -dstport tftp
{This is what should be seen from a normal tftp request}
{If it doesn't appear, suspect tftp problems on the server}
Using interface le0
icmp type
lnth proto source destination src port dst port
65 udp penguin batman 1604 tftp
5. Troubleshooting/Debugging the Client Boot Process (Continued)
C. During the Bootparam stage
On the server :
(Find and kill the rpc.bootparamd process, and restart)
% rpc.bootparamd -d {this turns on debug mode}
{watch for messages as the client boots}
{This is what should be seen from a normal bootparam request}
Whoami returning name = penguin, router address = 129.145.30.21
On a third system, run etherfind:
% etherfind -r -host penguin
{This is what should be seen from a normal bootparamrequest}
getfile_1: file is "batman" 129.145.30.15 "/export/root/penguin"
UDP from penguin.1023 to network.sunrpc 108 bytes
RPC Call portmapper PMAPPROC_CALLIT V2
UDP from penguin.1023 to mtnview.sunrpc 108 bytes
RPC Call portmapper PMAPPROC_CALLIT V2
60 arp penguin batman
UDP from penguin.1022 to batman.641 100 bytes
RPC Call prog 100026 proc 2 V1
60 arp penguin batman
'ethers', 'hosts', and 'bootparams' are used to refer to
/etc/ethers, /etc/hosts, and /etc/bootparams if YP is not
running on the server. Otherwise, they refer to the
YP maps if YP is running.
To further explain the boot process, I will use the
following as an example :
Hostname IP_address Hex_IP Ethernet_address
------------------------------------------------------------
Server: batman 129.145.30.15 81911E0F 8:0:20:6:d4:f5
Client: penguin 129.145.30.27 8191271B 8:0:20:7:6:b9
============================================================
There are 3 distinct procedures which take place during a
4.x diskless client bootup.
A. The RARP Process (Reverse Address Resolution Protocol)
During this process the client broadcasts its
48 bit ethernet address to its local network.
Any system running rarpd, and also has the client's
ethernet address in 'ethers' , will then take the
hostname extracted from ethers and look up that hostname
in 'hosts'. If the host is found, that system will return
the IP address associated with the host back to the client.
The client will then say "Using IP address xxxx = xxxx"
(Except on sun4c clients, which are silent at this point).
B. The TFTP Process (Trivial File Transfer Protocol)
The client now uses the Hexadecimal representation of
its IP address to issue a tftp request across the net.
The server must have the same Hexadecimal number in its
/tftpboot directory as a symbolic link to boot.`arch -k`
where `arch -k` is one of {sun3, sun4, sun4c, or sun3x}
i.e., lrwxrwxrwx 1 root wheel 10 Sep 26 13:31 81911E1B ->
boot.sun4c
The server must also have the following link in the
/tftpboot directory:
lrwxrwxrwx 1 root wheel 1 Jul 21 1989 tftpboot -> .
The server must also have tftp uncommented in the files;
/etc/inetd.conf and
/etc/services or YP 'services' map if running YP.
Once the client successfully finds the boot file from the
server, it downloads it into its local memory. Then the
boot prom executes the bootfile just downloaded sending
out an RPC bootparam request to the network.
C. The Bootparam Process
Any system on the same net as the client which is running
rpc.bootparamd, and also has 'bootparams' info for the
client, will respond to the client which system and
path to NFS mount its root and swap file systems from.
The client will then attempt to mount its root and swap
file systems from the server defined from bootparams.
The server must have the following /etc/exports entry for
the client's root and swap, and have run 'exportfs -a'
/export/root/penguin -root=penguin,access=penguin
/export/swap/penguin -root=penguin,access=penguin
(This can be verified by running exportfs with no options.)
1. What the client console should look like when booting :
>b le()
Boot: le(0,0,0)
Using IP Adress 129.145.30.27 = 81911E1B
Booting from tftp server at 129.145.30.15 = 81911E0F
action here: Spinning propeller .... -/|\-/|\|/-\| ,
(** or ** incrementing numbers in a box on Sun4c clients )
Downloaded xxxxx bytes from tftp server.
Using IP Address 129.145.30.27 = 81911E1B
hostname: penguin
domainname: gotham
server name 'batman'
rootpathname '/export/root/penguin'
root on batman:/export/root/penguin fstype nfs
Boot: vmunix
Size: #####+#####+#####
<normal boot follows>
......
......
......
2. The Client<->Server Dialog <Problems/What to look for/What to do>
A. What can go wrong during the RARP stage ?
- Any blank lines or trailing spaces on lines in 'ethers' file
will cause RARP to fail.
- Any leading 0s between colons in 'ethers' will cause RARP to fail.
Correct : 8:0:20:7:6:b9 penguin
Incorrect: 8:00:20:07:06:b9 penguin
- Uppercase hostnames for clients will cause RARP to fail.
Lookup in 'ethers' succeeds, but gethostbyname() converts uppercase
to lowercase, then looks up the lowercase name in 'hosts' and can't
find it.
If running YP, the makedbm for hosts.{byname,byaddr} can be modified
in /var/yp/Makefile to use the '-l' option to convert uppercase to
lowercase.
- If nit, pf, nbuf, or clone are commented out of the server's kernel,
rarpd will fail to run. The following lines must be included in the
server's kernel:
pseudo-device snit # streams NIT
pseudo-device pf # packet filter
pseudo-device nbuf # NIT buffering module
pseudo-device clone
2. The Client<->Server Dialog <Problems/What to look for/What to do>
(CONT.)
B. What can go wrong during the TFTP stage?
- 'tftp timeout' :
This is a common error when tftp is commented out
of either /etc/inetd.conf or /etc/services.
This error can also occur if the hexadecimal
representation of the client's IP address is
missing or incorrect in /tftpboot, or if the
tftpboot -> . link is missing in /tftpboot.
Lowercase hexadecimal characters in the boot
file link will also cause this failure
ie.: incorrect -> 8191271b instead of
correct -> 8191271B.
- 'file not found' :
tftp is not able to find the boot.`arch -k`
file. This is common with 4/60 and 3/80 clients,
when setup_exec has not been run for sun4c or
sun3x respectively.
- panic ......... :
If setup_client was specified with the
wrong architecture (i.e., sun4 instead of sun4c)
it will probably panic when the boot prom
tries to execute the boot file.
C. What can go wrong during the Bootparam stage?
- bad dialog with bootparam server :
This error is common if there is a third party system on
the same network as the client, and that third party machine
is also running RPC. The third party system mayrespond to
the client's bootparam request with an RPC_FAILED response
back to the client before the real server can respond with
RPC_SUCCESS reply.
The diskless_boot_hang patch for bug #1018791 will usually
fix this problem, however, in cases where it doesn't, it
may be necessary for the customer to upgrade their bootprom
to revision 3.0 or greater. This patch is also for slow
booting problems.
- If client's NFS server as listed in bootparams is down or
on another network, client will see on the console:
hostname: penguin
domainname: gotham
server name 'batman'
/
Requesting Ethernet address for 129.145.30.15 = 81911E0F
C. What can go wrong during the Bootparam stage? (Continued)
- NFS error 13:
Thisis an NFS write error message, which usually
indicates that the client does not have root access
for the indicated file system. This usually happens
when the server is exporting /export, as well as
/export/root/client and /export/swap/client. If
/export is already exported then server will not be
able to export subdirectories also.
- null hostname returned from bootparam server - or -
- null domain returned from bootparam server :
These error messages indicate that there is probably
a Silcon Graphics system on the same network as the
client. If this is the case, have the customer kill
the rpc.bootparamd on the Silicon Graphics systems if
they are not boot servers. If they are boot servers,
then they can anonymous ftp to sgi.sgi.com, cd to
sgi/src, change to binary mode and get rpc.bootparamd.Z,
or contact Silicon Graphics Support hotline at:
1-800-345-0222 for the SGI bootparamd patch.
- clntkudp_callit retries exhausted:
This can come from an extremely busy network, or
from name lookup problems created by using a libc.so
shared library with name resolver routines built-in.
- bp_getclntent failed , bp_getclntkey failed :
These messages will pop up if the name of the client
in 'bootparams' is not the first hostname listed after
the IP address in 'hosts'. This usually happens when
client is in a nameserver domain. Also be aware of
servers which are using the libc.so library with
resolver routines built-in. This library bypasses both
NIS(YP) and /etc/hosts, and looks only at the nameserver
hosts database (i.e., hostname=penguin, but entry in 'hosts'
is 129.145.30.27 penguin.sun.com penguin #).
C. What can go wrong during the Bootparam stage? (Continued)
- "whoami RPC call failed with status #
\ panic: vfs_mountroot: cannot mount root":
There is a bug in a DEC product that causes these symptoms.
So, with a VMS VAX on the Ethernet AND running 4.0, if
the booting problems described here occur, find out from
the system manager if something called the "ULTRIX BRIDGE"
is being run. ULTRIX BRIDGE is DEC's own version of the
Wollongong package that allows VMS machines to use REAL datacomm
protocols. At any rate, DEC has a patch for this software.
3. SparcStation1 (4/60) anomaly:
The default client vmunix will not boot properly. Remake the
client kernel.
To remake SS1 client kernel (need to be root):
# cd /usr/share/sys/sun4c/conf
# cp GENERIC CLIENT
# vi CLIENT
change => config vmunix swap generic
to => config vmunix root on type nfs swap on type nfs
# config CLIENT
#cd ../CLIENT ; make ; cp vmunix /export/root/penguin/vmunix
Reboot the Sparcstation1 client.
3.1 3/80 anomaly :
The DL80 config file in 4.0.3 has the following:
% more /usr/share/sys/sun4C/conf/DL80
config vmunix root on nfs
which should be:
% more /usr/share/sys/sun4C/conf/DL80
config vmunix root on type nfs swap on type nfs
4. Booting off of Server's 2nd Ethernet (ie1, le1, ..) NOT SUPPORTED!!
Suppose server has two interfaces :
ie0 = batman
ie1 = batman-gw
Server must run 'rarpd ie1 batman-gw'.
Server's 'bootparams' must look like :
% more /etc/bootparams
penguin root=batman-gw:/export/root/penguin
swap=batman-gw:/export/swap/penguin
Client's fstab ( /export/root/penguin/etc/fstab ) needs to have:
% more /etc/fstab
batman-gw:/export/root/penguin / nfs rw 0 0
batman-gw:/export/exec/sun4c /usr nfs ro 0 0
batman-gw:/export/exec/kvm/sun4c /usr/kvm nfs ro 0 0
batman-gw:/export/share /usr/share nfs ro 0 0
batman-gw:/home/penguin /home/penguin nfs rw 0 0
5. Troubleshooting/Debugging the Client Boot Process
A. During RARP stage
On the server :
% ps ax | grep rarpd
{Two rarpd processes should appear for each interface.
Make a noteof the lowest # PID.}
% trace -p PID_of_lowest_rarpd
{Some output should appear when the client broadcasts its
ethernet address, including something like:
open ("/etc/hosts", 0, 0666) = 5}
AND/OR
% kill -9 both_rarpd_pids
restart rarpds with the debug [-d] option i.e.:
% rarpd -d if# hostname (4.1 usage : 'rarpd -a -d')
On a third system run etherfind :
% etherfind -rarp -o -broadcast
{This should be seen from a normal rarp request.}
Using interface le0
icmp type
lnth proto source destination src port dst port
60 rarp old-broadcast old-broadcast
B. During the TFTP stage
On a third system run etherfind:
% etherfind -dstport tftp
{This is what should be seen from a normal tftp request}
{If it doesn't appear, suspect tftp problems on the server}
Using interface le0
icmp type
lnth proto source destination src port dst port
65 udp penguin batman 1604 tftp
5. Troubleshooting/Debugging the Client Boot Process (Continued)
C. During the Bootparam stage
On the server :
(Find and kill the rpc.bootparamd process, and restart)
% rpc.bootparamd -d {this turns on debug mode}
{watch for messages as the client boots}
{This is what should be seen from a normal bootparam request}
Whoami returning name = penguin, router address = 129.145.30.21
On a third system, run etherfind:
% etherfind -r -host penguin
{This is what should be seen from a normal bootparamrequest}
getfile_1: file is "batman" 129.145.30.15 "/export/root/penguin"
UDP from penguin.1023 to network.sunrpc 108 bytes
RPC Call portmapper PMAPPROC_CALLIT V2
UDP from penguin.1023 to mtnview.sunrpc 108 bytes
RPC Call portmapper PMAPPROC_CALLIT V2
60 arp penguin batman
UDP from penguin.1022 to batman.641 100 bytes
RPC Call prog 100026 proc 2 V1
60 arp penguin batman
Top
Sun Proprietary/Confidential: Internal Use Only
Feedback to SunSolve Team