SunSolve Internal

Infodoc ID   Synopsis   Date
14171   kernel tips: Causes of asynchronous memory fault panics   8 Jan 1997

Description Top

What causes an "asynchronous memory fault" panic?



There are two main causes of asynchronous memory fault panics.


1) The CPU cache did not flush properly to main memory.

The CPU can modify cache rows in its cache, such as cached data which has
been changed by a program.  This data must be written out to main memory
at some point if it is to be accessed by other processors or stored onto
disk.  The write that takes place is asynchronous to the part of the CPU
that uses the data (the part which makes calculations, etc).  It takes
place from an on-chip write buffer, to which cache rows are queued;
writes from this buffer out to main memory are completed by a different
part of the chip.  The "asynchronous memory fault" occurs when the
asynchronous write from the cache to main memory terminates with an
error.

(Note that the actual write always takes place from the on-chip write
buffer regardless of whether the MMU is in write-through or copy-back
mode, or uses data that is marked non-cacheable.)

The error can be due to any hardware along the path between the cache
itself and the memory, including the CPU module, the motherboard or the
memory.  Look elsewhere for more clues as to what could be causing the
problem, to narrow down the bad hardware.  Check the /var/adm/messages
files and dmesg output for other kinds of errors, perhaps (ecc) memory
errors which would indicate memory problems, other kinds of CPU errors
which would indicate a bad cpu module, Mbus timeout errors (which point
to a potentially bad motherboard), and so on.


2) An external device attempted to read or write a bad memory address.

This could be a hardware problem where the device was properly set up
but accessed a bad address, or the memory could be bad; or it could be a
software problem, because a device driver did not set up its device to access 
the proper part of memory.  Such a memory fault is asynchronous with
respect to the CPU, because the device tried to do DMA to the memory,
independent of the CPU.

The way to tell whether or not the case is (1) or (2) is to observe the
logistics of when the problem happens.  Does this problem happen
consistently while a particular thing is going on?  Software problems
tend to be consistent, predictable and replicable; whereas hardware
problems tend to be more random.  Things to look for: 

- Is there a third party device, which, when operated, triggers this
  panic condition?

- Are there DMA errors in the /var/adm/messages file to point the way
  to a suspect device?
Product Area Kernel
Product crash
OS any
Hardware any

Top

SunWeb Home SunWeb Search SunSolve Home Simple Search

Sun Proprietary/Confidential: Internal Use Only