SunSolve Internal

Infodoc ID   Synopsis   Date
14133   kernel tips: Is system crash due to hardware or software?   8 Jan 1997

Description Top

Your system crashed.  Is it hardware or software?  Logistics and logfiles
provide the clues.

The circumstances during which (or after which) the system started to crash is
important in determining whether the problem is hardware related.  Hardware
problems tend to be more random, change in frequency, and can start out of 
nowhere.  Software problems tend to be more predictable and methodical.

Errors logged in the /var/adm/messages file, dmesg output, and perhaps other
logfiles as well, help pinpoint what was going on to cause the crash.  Suspect
hardware if hardware errors are seen in the logfiles.

A.	Logistics of hardware failures:

1.	When the system started crashing.

Hardware problems often occur after some trauma to the hardware.  This
includes power failures, hardware modifications, hardware additions, and
improper handling.  A problem that surfaces after no hardware or software
changes is also a good bet to be hardware-related.

a.	It started crashing after power failures.

Power failures and the related power surges that happen when power is
restored can zap hardware.  While it is not likely that new hardware will see
trouble after a power blackout, each time a system is turned on or off wears
it out a little bit.

The stresses on hardware of it cooling down if it is left powered off for a
while, and the stress of it heating up when powered on again, also contribute
to it wearing out.  It is better to leave a system on all of the time than to
turn it on and off.

Older systems are more susceptible to this type of condition being a problem.
This is because older systems have been in use longer, are more worn out, and
are thus weaker.  What a new system might endure easily could be the last
straw for a weaker older system.

b.	The system started crashing after hardware modifications.

System trouble after modifying hardware is most likely a hardware problem
because it is the hardware that has been changed.  The issue could be
something as simple as an improperly seated board, so try reseating the boards
and cables before doing anything else.  Incompatible hardware could have been
added to the system, rendering the whole system unusable.  Remove all newly
added hardware from the system and restore the old hardware, if possible, to 
see if the problem goes away.

An improperly grounded technician touching the hardware while modifying it
can zap it with static and break it.  This sort of problem might not be as
obvious, though, because the hardware might still function for a while, 
or might die right away, depending on the extent of the damage.

c.	The system started crashing after being dropped.

Suffice it to say that no computer hardware is indestructible.

d.	The system started crashing out of the blue, after not having been
modified at all.  It had been running for a long time before this crash, and
now it crashes frequently.

"Software rot" is not likely on a Unix system.  Chances are good that if a
system starts crashing out of the blue, the problem is that a piece of
hardware wore out.

2.	How the system crashes.

The frequency, consistency and timing of crashes is a telltale sign of whether
the problem is hardware or software related.  Randomness, increasing
frequency, correlation to temperature, or other conditions is a sign of
hardware problems.

a.	The system crashes in random ways.

Sometimes it hangs, sometimes it panics, sometimes the screen goes black;
it's different every time.  When it panics, it panics in different places
with different tracebacks.  (The nature of the panic will probably be a
data fault or some other exception condition, most likely induced by a stuck
bit in the cpu, a bad spot in memory, or other hardware problem.)

b.	The system's crashing more and more frequently with each passing day.

Hardware problems will begin to occur more and more frequently as time goes
on, as the hardware becomes less and less "marginal" and more and more
broken (i.e. worn out).

c.	The system tends to crash after it's been on for a while; or the
system tends to crash within a few minutes of it being powered on, but then
stays up after it is warmed up.

Hardware problems can change character as the system temperature does.
Marginal connections can be broken when the system is used under conditions
"abnormal" than those it is used to.  For example, a system might fail memory
diagnostics when it has just been powered on, but might pass them after having
been left on for a while.

d.	The system crashes when doing any of a wide variety of things.

One coredump says a panic occured while ufs was doing its thing.  Another
coredump says a panic occured while transferring something over the ethernet.
Yet another says a panic occured while the system was running idle.  If there
are no routines in common with the crashes, suspect hardware.

e.	The system did a watchdog reset.

Watchdog resets are usually hardware related.  They occur when a processor
gets a second trap while in the middle of initial processing of a first trap,
during the period when trap handling is disabled.  During this period, the
system does not know what to do with the second trap, so it just stops.
Software can cause this, but most Solaris bugs in this vein have been fixed.
A problematic piece of hardware, on the other hand, could cause spurious traps
to be sent, some in the middle of processing of legitimate traps.

f.	The system leaves no messages when it reboots.

This could be a watchdog reset on a system with its obprom "watchdog-reboot?"
flag set to true.

g.The system panicked with an "asynchronous memory error".

This indicates a memory problem.

h.	The system panicked with an "asynchronous memory fault".

This indicates a problem somewhere along the path between the CPU, the
motherboard and the memory.  The CPU was likely writing out its cache to main
memory when the fault occured.  Look for other signs of problems to help
narrow down the faulty part.

C.	Logistics of software problems.

Software problems are more predictable.  They are replicatable, occurring when
the same piece of software is run under the same circumstances.

The main trait of a software problem is consistency.  Software problems are 
usually replicable, and start with a reason (e.g. a change to software, as  
opposed to "out of the blue" which is a trait of a hardware problem).

1.	When the system started crashing.

If a new piece of software was added, or some software was changed, just
before the crashing started, the new or changed software is suspect.  The
problem could be hardware-related in this case only if the new software
exercises hardware in new ways.

If using existing software in new ways causes a crash which can be repeated,
that software is suspect.

2.	How the system crashes.

A system that crashes in the same way every time, or that references the
same routines in the panic stacktrace each time, points to the routines in
common in all stacktraces as suspect.

Systems that hang consistently while doing a particular operation could be
either hardware or software.  If the system hangs because it has run out of
memory, the problem is software (a memory leak).  Doing a crash / map
kernelmap can be used to see whether or not the system is out of kernelmap
memory, and a crash / kmastat (or better yet, turning on kmem_debug flags
(see SRDB 12172 about this) and using crash / kmausers) can help narrow down
the driver hogging the memory.

B.	How to tell what is wrong.

Whether the problem is hardware or software related, the key to telling what
is wrong is to look for things in common among several crashes.  A system's
ethernet hardware may be suspect when, out of the blue, a system starts
experiencing hangs and panics when trying to do nfs or when trying to open a
socket on another system.  What do these things have in common?  The network.
The fact that it started "out of the blue" indicates that hardware is suspect.
Another example:  a system panics while writing a file to a ufs file system,
and panics again while trying to open a file.  The problem has been occurring
since they first installed their latest version of the OS.  The ufs file
system code is suspect, because ufs routines will be on the stacktrace of
both panics.

C.	Logfiles are a big help in determining problems.

The /var/adm/messages file is where errors get logged.  Many a panic or
problem condition is prefaced by messages in the messages file.  If there are
panics with ufs (such as "freeing free block"), and there are disk errors in
the messages file, suspect the hardware before the software.  The software
cannot work without cooperative hardware.  If there are messages that the
system is running out of memory before a hang, perhaps there are messages from
the routines needing the memory which can provide a clue as to who is hogging
it.
Product Area Kernel
Product crash
OS any
Hardware any

Top

SunWeb Home SunWeb Search SunSolve Home Simple Search

Sun Proprietary/Confidential: Internal Use Only
Feedback to SunSolve Team