Infodoc ID |
|
Synopsis |
|
Date |
12031 |
|
Capturing system hangs and crashes on Solaris 2.X |
|
2 Sep 1999 |
Collecting System Crash Dump Images
On Sun Solaris 2.X Systems
+----------------------+
| Panic() & Savecore |
+----------------------+
When a Solaris 2.X system panics, the panic() routine writes
an image of system memory to the dump device. This image is
delimited by short dump records, one at each end of the dump
image.
When the system reboots, /etc/init.d/sysetup is run. This
script can be used to call the savecore utility. By default,
the section of Bourne shell code which calls savecore is
commented out. The system administrator must uncomment it.
When run, savecore examines the dump device. If the two short
dump records are seen and it appears that a valid system crash
dump image exists, savecore will read the image and write it
into a disk file in a specified directory. Savecore also puts
a copy of the kernel namelist into this directory.
+---------------------------------+
| Dump Device Disk Requirements |
+---------------------------------+
The panic() routine is a rather primitive routine. It may not
know about volume managers or other advanced disk management
techniques and sub-systems.
Panic() can only write to one dump device. This will be the
primary swap device; in other words, the first swap device
listed in /etc/vfstab.
Crash dumps vary in size based on the memory configuration of
the system and how much of that memory was in use. Crash dumps
that use the entire allowed 2gb primary swap partition have been
seen on large systems, and in 64-bit Solaris 7, even larger
corefiles will sometimes be compressed to fit into a 2-gb swap
area.
Individual workstations tend to have much smaller crash dumps
and are often less than 50mb in size.
The primary swap device (disk partition) must be large enough
to hold the system crash dump image, and, before Solaris 7 systems,
must not be ONE BYTE larger than 2.0 gb, not even as a result of rounding
by the paritition or format commands, unless you have Solaris 2.6
with patch 107490, or 2.5.1 with 108083. See SRDB 6467.
+------------------------------+
| Savecore Disk Requirements |
+------------------------------+
Savecore is called from /etc/init.d/sysetup (which is hard-
linked to /etc/rc2.d/S20sysetup). Savecore is called with one
argument: the name of the directory where the dump image is to
be stored.
The specified savecore directory must be on a filesystem which
has enough disk space free on which to write the system crash
dump image. Remember that the image can be quite large at times.
If you are concerned about savecore taking too much space in the
filesystem, you may create the file minfree in the directory in
which savecore is to save the files. In this file, place a number.
This number specifies the minimum free space (in kilobytes) that
must be available in the filesystem for a dump to be created.
+-----------------------+
| /etc/init.d/sysetup |
+-----------------------+
By default, for version 2.x (not Solaris 7)
the last few lines of the sysetup script reads as:
##
## Default is to not do a savecore
##
#if [ ! -d /var/crash/`uname -n` ]
#then mkdir -p /var/crash/`uname -n`
#fi
# echo 'checking for crash dump...\c '
#savecore /var/crash/`uname -n`
# echo ''
For Solaris 7, do man dumpadm to get savecore information.
To enable savecore, the system administrator needs to uncomment
all of these lines. The result should look like this:
#
#Default is to not do a savecore
#
if [ ! -d /var/crash/`uname -n` ]
then mkdir -p /var/crash/`uname -n`
fi
echo 'checking for crash dump...\c '
savecore /var/crash/`uname -n`
echo ''
If /var is part of the root filesystem, chances are very good
that this filesystem is just not roomy enough to be used for
crash dumps. Therefore, it will often be necessary to customize
three of these lines. For example:
#
# Default is to not do a savecore
#
if [ ! -d /bigdisk/crashes/`uname -n` ] <--- 1
then mkdir -p /bigdisk/crashes/`uname -n` <--- 2
fi
echo 'checking for crash dump...\c '
savecore -v /bigdisk/crashes/`uname -n` <--- 3
echo ''
`uname -n` specifies use of the system hostname as part of the
savecore directory name. Alternatively, savecore can be called
without use of the hostname. For example:
savecore -v /home8/my_panics
Note also that there is a -v option to savecore which can be
used to get more "verbose" output from savecore.
+------------------------------+
| Testing The Savecore Setup |
+------------------------------+
Intentionally crashing a system is not recommended. However,
there are occasions when this is required for various reasons.
If you are the system administrator or system owner, and you
must force your system to crash in order to test your savecore
setup, please do the following:
1) Back up all of your data. Systemcrashes can result in
non-recoverable and catastrophic loss of data.
2) Gracefully halt your system using 'halt' or 'init 0'.
3) At the OK> boot prom prompt enter: sync
Your system should start panic'ing at this time. You should
see "dumping" messages.
4) Next, the system will attempt to reboot. During this
process you should see some savecore messages.
5) Once the system is rebooted, look in your savecore directory
and see if you have system crash dump files there. They
will be named "unix.#" and "vmcore.#", where # is the crash
number. There should also be a "bounds" file. This contains
the next crash number for savecore to use.
+----------------------------------+
| Converting A Hang Into A Panic |
+----------------------------------+
Hung systems are the most difficult to debug. Fortunately,
sometimes a hang can be converted into a panic and an image of
memory can be obtained which can later be analyzed. This is
*NOT* always the case, however.
Before trying to panic a hung system, make sure the system is
really hung first!
1) Are *ALL* of the users affected by the hang?
2) Can you ping the system?
3) Can you remotely log into the hung system?
4) Can root log in on the console?
If you are sure the whole system is hung, try to force a panic.
This is done by following the savecore test steps 4 through 7
described earlier where we "L1-A" the system.
If L1-A doesn't result in a boot prom prompt, try disconnecting
and reconnecting the console keyboard. Only use this as a last
resort and if you are really desperate to get a crash dump, as
this step can occasionally cause hardware problems. (In general,
you should never disconnect hardware which is powered up.)
If you can not force a panic, you will have to power cycle the
system and let it reboot normally. Note that as soon as you
remove power from the system, the contents of memory is lost
forever! Forcing a panic *AFTER* power cycling will result in
a system crash dump which will *not* contain evidence as to why
the system had hung up earlier.
+-------------------------------------------+
| What To Do With System Crash Dump Files |
+-------------------------------------------+
Once you have successfully collected a system crash dump image,
you have 2 possible courses of action:
1) Call SunService for assistance (see Infodoc 14230)
2) Analyze the crash dump files on your own (see Infodoc 12936 and
13039)
For additional information about crash dump analysis, refer to
the book "Panic! UNIX System Crash Dump Analysis" by Chris Drake
and Kimberley Brown, ISBN 0-13-149386-8. Panic! is available
through SunExpress, SunSoft Press, and Prentice Hall.
See also:
srdb 6660 savecore reports: savecore: /dev/dump: No such device
srdb 6467 savecore is enabled, but a coredump is not produced
srdb 14172 How come a system corefile was created when the system did
not crash?
infodoc 6332 how to enable savecore in Solaris 2.x
faqs 1563 How to save a system crash dump
faqs 1611 How to save a system crash dump
faqs 2220 How to setup a tipline on a x86 2.5.1 system for kadb
srdb 10170 To save crashdump when machine panics at kadb prompt
srdb 17314 How to retrieve a crash dump from a SunScreen SPF-200
infodoc 11816 How to force crashes on Solaris X86 machines
srdb 16646 No suitable partition from swapvol to set as the dump device
infodoc 13981 Solaris 2.3 Patch Report Update
infodoc 15484 Limiting the size of a panic dump under Solaris 2.5.1
infodoc 15553 Forcing a core dump on an x86 system
infodoc 17152 watchdog FAQ
Top
Sun Proprietary/Confidential: Internal Use Only