SunSolve Internal

Infodoc ID   Synopsis   Date
12031   Capturing system hangs and crashes on Solaris 2.X   2 Sep 1999

Description Top

Collecting System Crash Dump Images
                    On Sun Solaris 2.X Systems


+----------------------+
|  Panic() & Savecore  |
+----------------------+

When a Solaris 2.X system panics, the panic() routine writes
an image of system memory to the dump device.  This image is
delimited by short dump records, one at each end of the dump
image.

When the system reboots, /etc/init.d/sysetup is run.  This
script can be used to call the savecore utility.  By default,
the section of Bourne shell code which calls savecore is
commented out.  The system administrator must uncomment it.

When run, savecore examines the dump device.  If the two short
dump records are seen and it appears that a valid system crash
dump image exists, savecore will read the image and write it
into a disk file in a specified directory.  Savecore also puts
a copy of the kernel namelist into this directory.


+---------------------------------+
|  Dump Device Disk Requirements  |
+---------------------------------+

The panic() routine is a rather primitive routine.  It may not
know about volume managers or other advanced disk management
techniques and sub-systems.

Panic() can only write to one dump device.  This will be the
primary swap device; in other words, the first swap device
listed in /etc/vfstab.

Crash dumps vary in size based on the memory configuration of
the system and how much of that memory was in use.  Crash dumps
that use the entire allowed 2gb primary swap partition have been
seen on large systems, and in 64-bit Solaris 7, even larger
corefiles will sometimes be compressed to fit into a 2-gb swap
area. 

Individual workstations tend to have much smaller crash dumps
and are often less than 50mb in size.

The primary swap device (disk partition) must be large enough
to hold the system crash dump image, and, before Solaris 7 systems,
must not be ONE BYTE larger than 2.0 gb, not even as a result of rounding
by the paritition or format commands, unless you have Solaris 2.6
with patch 107490, or 2.5.1 with 108083. See SRDB 6467.


+------------------------------+
|  Savecore Disk Requirements  |
+------------------------------+

Savecore is called from /etc/init.d/sysetup (which is hard-
linked to /etc/rc2.d/S20sysetup).  Savecore is called with one
argument: the name of the directory where the dump image is to
be stored.

The specified savecore directory must be on a filesystem which
has enough disk space free on which to write the system crash
dump image.  Remember that the image can be quite large at times.

If you are concerned about savecore taking too much space in the
filesystem, you may create the file minfree in the directory in
which savecore is to save the files.  In this file, place a number.
This number specifies the minimum free space (in kilobytes) that
must be available in the filesystem for a dump to be created.


+-----------------------+
|  /etc/init.d/sysetup  |
+-----------------------+

By default, for version 2.x (not Solaris 7)
the last few lines of the sysetup script reads as:

  ##
  ## Default is to not do a savecore
  ##
  #if [ ! -d /var/crash/`uname -n` ]
  #then mkdir -p /var/crash/`uname -n`
  #fi
  #                echo 'checking for crash dump...\c '
  #savecore /var/crash/`uname -n`
  #                echo ''  

For Solaris 7, do man dumpadm to get savecore information.

To enable savecore, the system administrator needs to uncomment
all of these lines.  The result should look like this:

  #
  #Default is to not do a savecore
  #
  if [ ! -d /var/crash/`uname -n` ]
  then mkdir -p /var/crash/`uname -n`
  fi
                  echo 'checking for crash dump...\c '
  savecore /var/crash/`uname -n`
                  echo ''  


If /var is part of the root filesystem, chances are very good
that this filesystem is just not roomy enough to be used for
crash dumps.  Therefore, it will often be necessary to customize
three of these lines.  For example:

  #
  # Default is to not do a savecore
  #
  if [ ! -d /bigdisk/crashes/`uname -n` ]               <--- 1
  then mkdir -p /bigdisk/crashes/`uname -n`             <--- 2
  fi
                  echo 'checking for crash dump...\c '
  savecore -v /bigdisk/crashes/`uname -n`               <--- 3
        echo ''

`uname -n` specifies use of the system hostname as part of the
savecore directory name.  Alternatively, savecore can be called
without use of the hostname.  For example:

  savecore -v /home8/my_panics

Note also that there is a -v option to savecore which can be
used to get more "verbose" output from savecore.


+------------------------------+
|  Testing The Savecore Setup  |
+------------------------------+

Intentionally crashing a system is not recommended.  However,
there are occasions when this is required for various reasons.

If you are the system administrator or system owner, and you
must force your system to crash in order to test your savecore
setup, please do the following:

1)  Back up all of your data.  Systemcrashes can result in
    non-recoverable and catastrophic loss of data.

2)  Gracefully halt your system using 'halt' or 'init 0'.

3)  At the OK> boot prom prompt enter:  sync
    Your system should start panic'ing at this time.  You should
    see "dumping" messages.

4)  Next, the system will attempt to reboot.  During this
    process you should see some savecore messages.

5)  Once the system is rebooted, look in your savecore directory
    and see if you have system crash dump files there.  They
    will be named "unix.#" and "vmcore.#", where # is the crash
    number.  There should also be a "bounds" file.  This contains
    the next crash number for savecore to use.


+----------------------------------+
|  Converting A Hang Into A Panic  |
+----------------------------------+

Hung systems are the most difficult to debug.  Fortunately,
sometimes a hang can be converted into a panic and an image of
memory can be obtained which can later be analyzed.  This is
*NOT* always the case, however.

Before trying to panic a hung system, make sure the system is
really hung first!

1)  Are *ALL* of the users affected by the hang?

2)  Can you ping the system?

3)  Can you remotely log into the hung system?

4)  Can root log in on the console?

If you are sure the whole system is hung, try to force a panic.
This is done by following the savecore test steps 4 through 7
described earlier where we "L1-A" the system.

If L1-A doesn't result in a boot prom prompt, try disconnecting
and reconnecting the console keyboard.  Only use this as a last
resort and if you are really desperate to get a crash dump, as
this step can occasionally cause hardware problems.  (In general,
you should never disconnect hardware which is powered up.)

If you can not force a panic, you will have to power cycle the
system and let it reboot normally.  Note that as soon as you
remove power from the system, the contents of memory is lost
forever!  Forcing a panic *AFTER* power cycling will result in
a system crash dump which will *not* contain evidence as to why
the system had hung up earlier.


+-------------------------------------------+
|  What To Do With System Crash Dump Files  |
+-------------------------------------------+

Once you have successfully collected a system crash dump image,
you have 2 possible courses of action:

1)  Call SunService for assistance (see Infodoc 14230)

2)  Analyze the crash dump files on your own (see Infodoc 12936 and
    13039)

For additional information about crash dump analysis, refer to
the book "Panic! UNIX System Crash Dump Analysis" by Chris Drake
and Kimberley Brown, ISBN 0-13-149386-8.  Panic! is available
through SunExpress, SunSoft Press, and Prentice Hall.

See also:
srdb     6660      savecore reports:  savecore: /dev/dump: No such device
srdb     6467      savecore is enabled, but a coredump is not produced
srdb     14172     How come a system corefile was created when the system did
                   not crash?
infodoc  6332      how to enable savecore in Solaris 2.x
faqs     1563      How to save a system crash dump
faqs     1611      How to save a system crash dump
faqs     2220      How to setup a tipline on a x86 2.5.1 system for kadb
srdb     10170     To save crashdump when machine panics at kadb prompt
srdb     17314     How to retrieve a crash dump from a SunScreen SPF-200
infodoc  11816     How to force crashes on Solaris X86 machines
srdb     16646     No suitable partition from swapvol to set as the dump device
infodoc  13981     Solaris 2.3 Patch Report Update
infodoc  15484     Limiting the size of a panic dump under Solaris 2.5.1
infodoc  15553     Forcing a core dump on an x86 system
infodoc  17152     watchdog FAQ
Product Area Kernel
Product crash
OS Solaris 2.x
Hardware any

Top

SunWeb Home SunWeb Search SunSolve Home Simple Search

Sun Proprietary/Confidential: Internal Use Only