SunSolve Internal

Infodoc ID   Synopsis   Date
17152   watchdog FAQ   19 Aug 1998

Description Top

watchdog FAQ

Who is this document for:

This document is targeted to customers whose systems are dropping
to the ok prompt because of a watchdog reset or other reason.

What is the purpose of the document:

This document describes how to collect the information that Sun
Customer Services needs in order to diagnose the reason for the
system behavior.  The document mentions infodocs and srdbs, which
are available to contract customers via the Sunsolve web site:
http://sunsolve.sun.com/
Technical Support Engineers at Sun can send any of the documents
you request, as well.

What is a watchdog reset?

A watchdog reset is an unrecoverable situation that forces the CPU
to reset.  It is caused as a result of the machine trapping while
handling a trap with the "Enable Traps" bit in the Processor Status
Register (PSR) being disabled.  The reason traps have been disabled
is that no other traps should occur unit the first trap has been
handled.  But because a second trap has occurred and the cpu cannot
handle it the machine resets.

Are there any other reasons that a system would drop to the ok prompt?

There are several other reasons.  First, if the system receives a
break via the console (because Stop-A was typed or the keyboard was
unplugged and replugged on a regular console, or if a break was sent
from a tty console), it will halt and produce the ok prompt.  We
recommend that this be attempted on a hung (unresponsive) system.
A kernel feature known as a deadman timer can also be enabled in
an effort to diagnose a hung system.  If this is enabled, when the
system hangs it will be dropped to the ok prompt.

Is a watchdog reset the same as a system panic?

No.  On a system panic, the system saves the kernel context to the
system's swap disk, and then sets a flag indicating there is a crash
dump before it reboots.  If savecore is enabled, a crash dump is recovered
during reboot, or a manual savecore can be run shortly after the reboot.
On a watchdog reset, minimal information is saved, and then the system
simply halts.

What happens when a system gets a watchdog?

The behavior of the system after a watchdog is determined by the value
of the watchdog-reboot prom variable.  To see the value, from a running
system use the eeprom command.  From an ok prompt, use the env command.
The default value (here as output from the eeprom command) is false:
watchdog-reboot?=false
This value will cause the system to stay at the ok prompt after it happens.
If watchdog-reboot is set to true, the system will reboot automatically.
If a system is rebooting for no discernable reason, we advise checking the
value of this parameter and setting it to false if it is true.  If the
system had been experiencing watchdog resets, this will allow the collection
of useful data next time it happens.

Is the procedure for dealing with watchdogs the same for all Sun systems?

No.  Some of the commands will work on all systems, and others are
only relevent to certain architectures and configurations.  You
can determine the architecture of a running system by using the
command uname -a, and observing the fifth field returned, which
should be sun4, sun4d, sun4m, sun4u, etc.  You can determine if
a system is a multi-processor (MP) system by using the mpstat
command.  If it returns just one line, it is a single-cpu system.
Otherwise, it is a multi-processor system with a cpu represented
by each line of output.  An MP system will include in the prom prompt
an indication of what cpu experienced the halt, for example <#2>
which indicates cpu2.  Please write down the number, as it can be
helpful in identifying which cpu to replace if the cause is found
to be a defective one.

Is there any way to tell for certain if a watchdog reset has occurred?

Systems with a sun4d or sun4u architecture have a command called prtdiag.
This is usually not in the default command path, so if you do not know
where to find it, do man prtdiag to find the path.  prtdiag -v will
display configuration data, followed by time of the last watchdog if
one has occurred.

What should be done when the system has dropped to the ok prompt?

Some commands should be run to capture the state of the system, and
then the sync command should be used to force a panic and
create a crash dump on the reboot.  Since the system is not running
when it is at the ok prompt, the output of the commands will not be
saved.  You must write down the results, or use the serial console
port connected to a tip session to capture the results.  Infodoc 15085
tells how to configure for a tip session.

What commands should be typed from the ok prompt?

The commands are described below.  There is a feature called obpsym
which, when enabled, will allow certain of the commands to provide
symbolic information which will make interpretation by Sun Customer
Services easier (and probably faster).  If you do not know how to
enable this, ask someone at Sun to send you internal infodoc 15876.

Commands that work on all systems:

.registers   This displays the internal registers of the current cpu.
.locals      This displays the registers in the current register window.
ctrace       This displays the kernel stack.  If obpsym is enabled,
             the output includes useful symbolic information.  If not,
             it produces numbers which must be interpreted in conjunction
             with a crash dump.  This is the single most useful command.

System-specific commands:

.psr         Only available on systems supporting SPARC V8 architecture.
             If you're not certain, try it.  Prints the Processor
             Status Register in a readable format.
wd-dump      Only available on sun4d architecture.  Displays watchdog
             data including the program counter of the instruction that
             caused the crash.

What should be done after the commands are typed, and the results
recorded?

Type sync which should cause a panic and a reboot.  When the system
has rebooted, check for a crash dump.  If savecore was enabled, the
path in /etc/init.d/sysetup should say where to find the corefiles.
If savecore was not enabled, manually run savecore -v <directory-name>
to produce the corefiles in that directory.  This must be done before
system activity causes the data in the swap disk to be overwritten.
For details on savecore configuration, see infodoc 12031 (for Solaris
2.x only), infodoc 11827 (for SunOS 4.x only), and infodoc 14230 (for
both o.s.'s).  After savecore has been done, contact Sun Customer
Services to have someone collect the data and analyze it.
Product Area Kernel
Product crash
OS Solaris 2.x
Hardware any

Top

SunWeb Home SunWeb Search SunSolve Home Simple Search

Sun Proprietary/Confidential: Internal Use Only