Info Docs Article 14138

SunSolve Internal
Infodoc ID Synopsis Date

14138 kernel tips: Use of the ps command in corefile analysis 29 Apr 1997

Description Top
The ps command can help in corefile analysis.  It is particularly useful in
hang analysis, as it can give a highlevel snapshot of the system.  It displays
what processes are running, which are waiting, and what waiting processes are
waiting for.  It can give an indication as to the history of what caused the
system to hang as well.

Specify the -efl option to ps when obtaining output for corefile analysis
purposes.  The -e option tells ps to print information for every process on
the system, which is key to getting the full picture of what is happening.
The -f and -l options print all information relevant to system analysis;
without them the amount of information provided by ps is too little.

o	How to tell whether a system is really hung.

A system which is hung in the truest sense will have no processes running.
Look at the "S" or state column to view the state of processes.  A system will
have as many processes running as it has processors, assuming there are at
least that many waiting for CPU service.  Running processes have a state of
"O" for "on processor."  A system having no "O" processes is hung.

Hung systems usually have many processes waiting on a particular thing, or have
processes waiting for something that another process, itself waiting for
something else, has.  Processes waiting for something are usually sleeping,
and show a state of "S."  The WCHAN column gives a unique value for the item
being waited for.  Processes listed by ps as waiting for the same thing will
show the same value in their WCHAN column.

The WCHAN column can list different things to wait for.  Conditional variables,
mutexes and read/write locks are the most common.  One must examine the process
with adb to tell what it is waiting for.

The ADDR field lists the process's proc structure, which is the most basic
per-process data structure.  Anything about a process can be found by starting
with the proc structure;  either the data is there, or can be found by
dumping other structures pointed to by that structure.

Dump out the threads of a process to find out what they are waiting for.
Pass the value in the ADDR field to the adb command macro $<proc, then get
the tlist value from the output and pass that to $<thread.  Get the sp value
from the output and pass that to $c.  The top thing on the stack will most
likely be cv_wait_* (in which case the process is waiting for a conditional
variable) or mutex_*_enter (in which case the process is waiting for  a
mutex).  The commands to do these things are:

  ADDR$<proc
    where ADDR is from ps, to get the proc structure.
  tlist_value$<thread
    where tlist_value is from proc output, to get the thread structure.
  sp$c
    where sp is the sp value from the thread output, to get the stacktrace.

Once the item is determined, it can be dumped.  If, for example, the top
routine in the stacktrace is mutex_adaptive_enter, the item refered to by
WCHAN is a mutex, and can be dumped by

  WCHAN_VALUE$<mutex

In this case, the mutex macro will print the owner thread of the mutex, and
that thread can then be investigated.  What is it waiting on? Another mutex?
Is there a deadlock condition?  ... and so on.

Multithreaded processes have a valid address of another thread in the "next"
field of the thread macro output.  Locate the "next" value from the output of

  tlist_value$<thread

and pass that value to the thread macro as well:

  new_tlist_value$<thread

then dump its stack, etc.

o	A system not hung in the truest sense.

If there are processes with a state of "O," the system is not hung.  Check the
amount of CPU time the "O" processes have used.  If they are way out of
proportion to other processes, they may be hogging the system, slowing it
down.  Be aware, though, that some processes, such as "fsflush," run from
boottime and rack up considerable CPU time normally.  Divide the TIME by STIME
to get a rough percentage of CPU time used, which is a better gage.

The sheer number of processes can tell how loaded a system may be.  A system
with hundreds of processes will run slower than one with only a few processes.
The number of processes running on the system can indicate that there may be
other problems too:  could there be a memory shortage due to a large number of
processes?  Is the page daemon working overtime and the system thrashing
trying to juggle enough memory to keep all those processes afloat and
operational?

Lots of zombie processes (state "Z") may indicate other problems.  These
processes were not cleaned up when they were supposed to have been.  Why?
Perhaps there is a kernel problem of some kind.

Output of ps can be used to spot missing processes.  A system could appear
hung in a certain area (say, networking) because a certain key process is
missing.  For example, if statd is missing (i.e. it does not show up in a
"ps -efl" command), nfs won't work.  This could make it look like the system
is hung for other processes trying to read files off an nfs server, but the
system may appear normal to a user logged in locally.

o	Running out of swap?

The number of memory pages (swap and resident) used by a process is indicated
by the (/usr/bin/ps) SZ column.  If a particular process has an inordinately
large SZ value, it is consuming inordinate amounts of virtual memory.  Check
to see if the system is running out of swap by doing a "wsinfo" command and
looking at the "virtual memory" meter, or by doing a swap -l and looking at
blocks free.

o	A word about /usr/ucb/ps.

This is a Berkeley flavor of the ps command, and provides one bit of useful
information not provided by the /usr/bin/ps command. When run with the "-aux"
option, the command displays the resident set size, or the amount of main
memory (not swap) used by a process in kilobytes; and the percentage of total
main memory used.

This article spotlights the /usr/bin/ps command because it provides the proc
addresses (ADDR column), which is used in the analysis examples above.
Product Area	Kernel
Product	crash
OS	Solaris 2.x
Hardware	any
Top
SunWeb Home SunWeb Search SunSolve Home Simple Search
Sun Proprietary/Confidential: Internal Use Only
Infodoc ID		Synopsis		Date
14138		kernel tips: Use of the ps command in corefile analysis		29 Apr 1997