Info Docs Article 13039

SunSolve Internal
Infodoc ID Synopsis Date

13039 Troubleshooting System Hangs 8 Aug 1999

Description Top
Part of the difficulty in working with system hangs is knowing whether
or not the system is actually hung.  Sometimes it appears that the
system is hung when in fact only a particular application is hung.
To help determine whether the system is actually hung and to help
diagnose the problem, ask these questions:

Can you rlogin or telnet to the system?

Can you ping the system?

Does the mouse track in the window?

What changes have been made to the system recently?

How often do the hangs occur?

What are the circumstances under which the hang occurs?

Can the hang be reproduced on command?

What is necessary to get out of the hang (i.e. can the
machine be L1-A'd)?

If you are able to rlogin or telnet to the system, it's probable that
the system is OK but some application is hung (possibly OpenWindows or CDE).


CHECKING FOR A RESOURCE DEPRIVATION HANG
----------------------------------------

The most common cause of a system hang is that the system has run
out of resources.  The first thing to normally do is to run some
performance tools to see whether that is in fact the case.

The following can be placed in a file and then invoked from cron
every 15 minutes to help determine if the system is CPU bound,
I/O bound, or memory bound.


#!/bin/sh

# A modification to allow this to run for many days and allow the sysadmin
# to delete old data
# set -x
DATESUFFIX=`date +%m%d`
OUTPUTDIR=/var/tmp/$DATESUFFIX

if [ !  -d $OUTPUTDIR ] ; then
    mkdir -p $OUTPUTDIR
fi

PATH=$PATH:/usr/sbin:/usr/bin  ; export PATH

date >> $OUTPUTDIR/vmstat.out
vmstat 30 5 >> $OUTPUTDIR/vmstat.out
OSREV=`uname -r`
case $OSREV in
    "5.6"* | "5.7"* )
        IOARGS="-xnctpE"
        ;;
    *) IOARGS="-xct"
       ;;
esac
date >> $OUTPUTDIR/iostat.out
iostat "$IOARGS"  30 5 >> $OUTPUTDIR/iostat.out
date >>  $OUTPUTDIR/ps.out
/usr/bin/ps -el -o pcpu,pmem,fname,rss,vsz,pid,stime >> $OUTPUTDIR/ps.out
date >>  $OUTPUTDIR/ucbps.out
/usr/ucb/ps -aux >> $OUTPUTDIR/ucbps.out
date >> $OUTPUTDIR/kmastat.out
echo kmastat | crash >> $OUTPUTDIR/kmastat.out
date >> $OUTPUTDIR/kernelmap.out
echo "map kernelmap" | crash >> $OUTPUTDIR/kernelmap.out
date >> $OUTPUTDIR/uptime.out
uptime >> $OUTPUTDIR/uptime.out
date >> $OUTPUTDIR/netstat.out
netstat -i >> $OUTPUTDIR/netstat.out

date >> $OUTPUTDIR/du.out
du -s /tmp >> $OUTPUTDIR/du.out
date >> $OUTPUTDIR/ls.out
ls -lt /tmp >> $OUTPUTDIR/ls.out

date >> $OUTPUTDIR/mpstat.out
mpstat 30 5  >> $OUTPUTDIR/mpstat.out

This script has recently been updated to separate data by days; the 
customer can still modify OUTPUTDIR if desired.

CPU Power
---------

In the vmstat command output, look to see what the run queue size is (first
column).  If the run queue has more than 3 processes waiting per CPU
(i.e. more than 3 for a 1 CPU system, more than 6 for a 2 CPU system, etc.),
then this bears watching.  If the run queue has more than 5 processes
waiting per CPU, there is insufficient CPU power in the system.

Virtual Memory
--------------

If over time, the amount of memory specified in the swap column
goes down and does not recover, then there is a probability that
there is a memory leak on the system.  To determine if it is a kernel
memory leak, look at the output of the two crash commands (described
later).  To determine which application has a memory leak, look at
the SZ column of the ps output.  This column indicates the size of a process's
data and stack in kilobytes.

If over time, the swap column goes down and recovers, look at the lowest
value in the swap column.  If this value goes below 4000, then the system
is in danger of running out of virtual memory space.  More swap space
should be added to the system.

Physical Memory
---------------

The sr (scan rate) column of the vmstat output indicates the rate at which
pages are being scanned in order to find needed pages for current processes.
If this rate is over 200 for prolonged periods of time, the machine is out
of physical memory.  This machine could benefit by additional physical
memory.

Kernel Memory
-------------

The kernel had a limited amount of memory which it uses for kernel
data allocations.  This memory is commonly referred to as the kernel
heap.  The maximum size of the kernel heap is fixed depending on 
machine architecture and the amount of physical memory.  If a machine
runs out of kernel memory, it will usually hang.  To see if this is
the case, look at the output of the crash kernelmap command.  This
command shows how many segments of kernel memory exist and how large
each segment is.  If there only 1 and 2 page segments left, the kernel
has run out of memory (even if there are a hundred of them).

The crash kmastat command shows how much memory has been allocated to
which bucket.  Prior to Solaris 2.4, this showed only 3 buckets making
it difficult to tell which bucket was hogging the memory (if any).
Starting with 2.4, kmastat breaks memory allocation down to many
different buckets.  If one of these buckets has several MB of memory,
and there are kernel memory allocation failures, there is probably
a memory leak involving the large bucket.  If the bucket is 
kmem_alloc_8192, this is the buffer cache bucket.  In systems with
very large amounts of memory, the amount allowed in this bucket should
be tuned by adding "set bufhwm=8000" in the /etc/system file.

In order to diagnose a memory leak problem it is possible to turn
on some flags in Solaris 2.4 and above (see SRDB 12172).  With
these flags turned on, once the kmastat command shows significant
growth in the offending bucket, L1-A should be used to stop the 
machine and create a core file.  SunService can use this core file
to help determine the cause of the leak.

Disk I/O
--------

To check for disks which are overly busy, look at the iostat output.  The
columns of interest are %b (% of time the disk is busy) and svc_t (average
service time in milliseconds).  If %b is greater than 20% and svc_t
is greater than 30ms, this is a busy disk.  If there are other disks which
are not busy, the load should be balanced.  If all disks are this busy,
additional disks should be considered.

There is no direct way to check for an overloaded SCSI bus, but if the %w
column (% of time transactions are waiting for service) is greater then 5%,
then the SCSI bus may be overloaded.

Information about what levels to check for the various performance statistics
is taken from "Sun Performance and Tuning" by Adrian Cockroft, 
ISBN 0-13-149642-5.

Additional performance gathering scripts can be gotten from Infodoc 2242
for Solaris 2.x and Infodoc 11365 for SunOS 4.x.


GENERATING CORE FILES
---------------------

If looking at the performance statistics is not enough to diagnose the
problem, it is necessary to get a core file.  Infodoc 12031 describes
how to do this.

If it is not possible to get a core file, then the situation is called
a hard hang.  Contact SunService for information on diagnosing hard
hang situations.

Analyzing system hang core files
--------------------------------

Once a core file is obtained, the first information to look at is a
threadlist generated by the adb command.

$adb -k unix.NUM vmcore.NUM | tee threadlist.NUM
physmem xxxxx
$<threadlist

The value NUM in the above commands will be replaced by the number of the
core files (e.g. unix.1, vmcore.1).  If this is a large system, go get
a cup of coffee while the threadlist runs (it can take up to 15 minutes).

Once the threadlist has been gotten, use a text editor to look at
the threadlist file generated.  An example of some threadlist output
follows (note that this example uses a modified version of the
threadlist macro):

                ============== thread_id        e0182000
p0:
p0:             process args=   sched
t0:
t0:             lwp             proc            wchan
                e01c9898        e01d6a48        0
t0+0x34:        sp              pc
                e0181ee8        sched+0x3f0
?(?) + effff090
main(0x0,0x3c,0x2,0xe01a7c00,0xe01d6a48,0xe01a7cb8)

                ============== thread_id        e0e81ec0
p0:
p0:             process args=   sched
0xe0e81ec0:     lwp             proc            wchan
                0               e01d6a48        0
0xe0e81ef4:     sp              pc
                e0e9fec0        poll_obp_mbox+0xbc
?(?) + effff090
poll_obp_mbox(0xfc,0xfc,0x0,0x742c,0xffffffff,0xe01c5e5c)
level14_handler(0xe0e81d54) + 4
L14_front_end(?) + 68
splclock(0x404000e6)
disp_getwork(0xe01c4a48,0x404000e6,0x0,0xffffffff,0xe10baec0,0xf7469030) + c
idle(0xe01c4a48,0x0,0xe01c4a4c,0xe01c4b64,0x0,0x0) + cc

                ============== thread_id        e0ea2ec0
p0:
p0:             process args=   sched
0xe0ea2ec0:     lwp             proc            wchan
                0               e01d6a48        0
0xe0ea2ef4:     sp              pc
                e0ea2610        complete_panic+0xd0
?() + effff090
data address not found

This output will have a brief description of information from the thread
structure followed by a stack trace for each thread in the system.  Threads
which show stack traces with data address not found have typically been
swapped out.  The threadlist macro shipped with the system will not give
anything but the thread pid and the stack trace.

Looking through this threadlist, look for stack traces which show
mutex_enter, rw_enter, or biowait as the top routine.  Also look for
threads trying to do kernel memory allocation (kmem_alloc*).

If there are many threads waiting on the same mutex (look at wchan), see
what thread owns the mutex and what it's doing.  Follow this trail to
see why the machine may be hung.

If there are many threads trying to do kmem_alloc, there is probably
a lack of kernel memory.  See the kernel memory section above for
information about setting kernel memory flags.  Once a corefile with
kernel memory flags set is obtained, check with SunService on how to
proceed.

If there are many threads waiting on biowait, check the buffers being
waited for to see if they are all doing I/O to the same disk.  Maybe
a controller is hung or not operating correctly.
Product Area	Kernel
Product	hang
OS	Solaris 2.x
Hardware	any
Top
SunWeb Home SunWeb Search SunSolve Home Simple Search
Sun Proprietary/Confidential: Internal Use Only
Infodoc ID		Synopsis		Date
13039		Troubleshooting System Hangs		8 Aug 1999