Within a system board, each of the four abuses is handled by one of the four CIC asics, which route the bus off-board as a Global Address Bus (GAB).
ECC or parity errors are also caught if they occur.
When this occurs during online OS (or OBP) operation, this will normally be detected by Edd polling scripts. hpost -D will be called to create a dumpfile of hardware system state on the SSP, and then hpost will be called to reboot the affected domain. The dumpfile can be examined using redx to determine the cause of the error.
For a basic guide to understanding Arbstops and how to deal with them, see Starfire Arbstops and redx 101.
Errors are caught by explicit comparison of the read
data with the expected value.
B
For Starfire POST's main memory test, errors are caught via the ECC trap mechanism in the hardware; the value at each location is not explicitly tested. When testing bbsram, DTAGs and registers there is explicit testing of the values.
The BSS pattern itself was developed to reduce the time necessary to test all the bits in a store for stuck-at and adjacency faults. If N is the width of a given location, then the traditional walking ones/zeros requires 2N patterns to be used. The BSS pattern (e.g. 00001111, 00110011, 01010101, 00000000, 11110000, 11001100, 10101010, 11111111) provides the same test coverage but uses only 2(1 + LOG2(N)) patterns, thus saving time.
The Starfire centerplane is logically divided into two separately powered halves for purposes of maintaining availability in the face of failure. Each half provides two GABs of address routing and arbitration, one 72-bit data path (or bus), and one Global Data Arbiter (GDARB) chip. Starfire can operate, with half its normal maximum interconnect bandwidth, with either half-centerplane powered down.
POST and redx designate a particular CIC as
board.gab, e.g., "CIC A.1", meaning the CIC for
GAB 1 on system board A (10 decimal).
The redx command to display CIC registers is
shcic.
VVVV Version (4 bits) [Also called Rev]. PPPPPPPPPPPPPPPP Part Number (16 bits) MMMMMMMMMMM Manufacturer's ID (11 bits) 1 Constant 1 (1 bit)
This is organized so that each DIMM provides the same 18 bits in each cycle of the UPA transaction.
POST always runs in a single domain, whose name is normally obtained from the SSP environment variable SUNW_HOSTNAME. Using this name, POST queries the SNMP database for information about the system boards in the domain.
Domain clusters will usually contain only a single domain, but will contain more than one if the domains have been configured to share memory.
In most cases transgression error would indicate a hardware error, since the data interconnect will not attempt to move data unless an address transaction indicates this should occur, and out-of-domain addreses are filtered by the domain logic in the address arbitration. However, interrupt address transactions sent to out-of-domain boards will be also ignored, and the lack of a NACK causes the data subsystem to try to transfer the mondo vector, so that software can cause this error by sending interrupts to target processors out of its domain.
When enabled, transgression error causes an arbstop in the source domain. However, it has been discovered that certain failures on one system board, such as a power supply failure, can corrupt the LDARB -> GDARB interface so as to cause both a transgression error and a parity error, which causes GDARB to request a global arbstop. To avoid this, transgression error reporting is disabled when configuring GDARB. Transgressing requests will still be inhibited, so they won't affect other domains, but they will not be reported. Real hardware transgresion errors will probably result in more obscure errors within the source domain, such as timeouts or queue underflows. Bogus interrupts will simply be discarded, with no error.
Reporting of transgression error by GDARB can be enabled with the .postrc command dom_transgress_err_enbl, if this is used when POST is run with -C to configure the centerplane. This may be useful in debug environments where more obscure errors are detected and transgression errors are suspected.
The DTAGs can be a source of confusion when they are detected to have failed. Because the DTAGs for each processor and IOC are distinct hardware resources, POST will often only mark the affected processor or IOC as FAILed, allowing the other resources on the system board to continue operation. However, it is the system board that must be replaced to fix the problem, not the processor or IO module.
Second, the global address bus (GAB) between the CIC asics and between the CIC and MC asics is protected by eight check bits on each 38 bits of information, allowing correction of single-bit errors and detection of all double-bit errors on these address connections.
See also recordstop.
POST and redx designate a particular GAARB as
gab, e.g., "GAARB 1", meaning the GAARB for GAB 1.
The redx command to display GAARB registers is
shgaa.
POST and redx designate a particular GAMUX as
gab.slice, e.g., "GAMUX 1.0", meaning the GAMUX
that handles slice 0 (bits [11:0]) for gab 1.
The redx command to display GAMUX registers is
shgam.
POST and redx designate a particular GDARB as
dbus, e.g., "GDARB 1", meaning the GDARB for dbus 1.
The redx command to display GDARB registers is
shgda.
GDARB can detect three types of errors on the request bus from the LDARB asic on each system board:
Physical addresses are interleaved across the address buses by the initiating PC, depending on the address bus configuration (See abus).
The MC asic may be configured to interleave the banks of memory it controls based on the address bus and physical address, depending on the number of active address buses and memory banks. In many cases, however, this will not provide any further interleave. For example, in the normal case there are four address buses. If there are four banks of memory, then each is connected directly to an abus. If there are two banks of memory, then the address buses are actually de-interleaved as two buses per bank.
Lastly, two boards with the same memory configuration can be interleaved based on a Physical Address bit. This is disabled by default because it can reduce the ability to use Dynamic Reconfiguration detach. It can be enabled from the .postrc file.
The JPR also supports some JTAG hardware peek and poke operations to memory and UPA ports. To support the memory functions, the JPR can also be accessed by JTAG directly to the XDB, but this is only possible when no system activity is in process.
To achieve this, SSP core software provides a three-level
hierarchical locking facility that allow an application
to reserve exclusive access to an individual JTAG ring,
an entire board's JTAG resources, or all the JTAG in the
entire cabinet.
K
L
POST and redx designate a particular LAARB as
board, e.g., "LAARB A", meaning the LAARB
on system board A (10 decimal).
The redx command to display LAARB registers is
shlaa.
POST and redx designate a particular LDARB as
board, e.g., "LDARB A", meaning the LDARB
on system board A (10 decimal).
The redx command to display LDARB registers is
shlda.
The buses from XDBs to LDMUX are called ldat_x0 - ldat_x3.
The buses from LDMUX to XDBs are called ldat_r0 - ldat_r3.
POST and redx designate a particular LDMUX as
board.slice, e.g., "LDMUX A.2", meaning the LDMUX
that handles slice 2 (bits [35:0] of
dbus
1)
on system board A (10 decimal).
The redx command to display LDMUX registers is
shldm.
POST and redx designate a particular MC as
board, e.g., "MC A", meaning the MC on system board
A (10 decimal).
The redx command to display MC registers is
shmc.
However, certain operations performed by Dynamic Reconfiguration and POST will exchange the assignments of memory addresses on physical boards. This mapping of PA ranges to physical boards is maintained in an SNMP element called the MemAddrMap.
For Starfire
POST's
main memory test, errors
are caught via the
ECC
trap mechanism in the hardware;
the value at each location is not explicitly tested.
When testing other resources,
there is explicit testing of the values.
POST and redx designate a particular PC as
board.asic, e.g., "PC A.1", meaning PC asic number 1
on system board A (10 decimal).
POST and redx designate a particular PUP as
board.asic, e.g., "PUP A.1", meaning the PUP
numbered 1
on the memory module on board A (10 decimal).
For a basic guide to understanding recordstops and how
to deal with them, see
Using redx to Debug a Data Recordstop.
POST
attempts to maintain a valid sigblock on all processors
it is testing, for identification by
Edd
and
hostview.
However,
there are times during POST execution, such as before it has
downloaded any host executables, while it is testing bbsram,
and during download of additional host executables, when a
valid sigblock will not be present.
Other than providing the POST signature for detection by
Edd, and setting the
post2obp structure
pointer in the
boot processor at the end of a successful configuration, POST
does not use the sigblock. It uses its own private methods for
communication between the host and SSP resident programs.
The
redx parse command
has options for calculating and interpreting the
syndromes used in Starfire data and address paths.
Translation between virtual and physical addresses is through
page tables in memory, which may be cached local to a processor.
The address translation is done by the processor's MMU, usually
transparently to the software.
POST and redx designate a particular XBAR as
dbus.slice, e.g., "XBAR 1.2", meaning the XBAR
that handles slice 2 (bits [35:24]) for dbus 1.
POST and redx designate a particular XDB as
board.asic, e.g., "XDB A.1", meaning XDB asic number 1
on system board A (10 decimal).
N
netcon
nibble
NMB
NPB
O
OBP
OS
P
PA
PC
The redx command to display PC registers is
shpc.
PCB
PCI
PCS
phase
physical address
pid
PLL
POST
.postrc
post2obp structure
processor module
psi bus
PSYCHO+
PUP
PUP is a mode of the
XMUX asic.
The redx command to display PUP registers is
shpup.
Q
R
RAS
recordstop
redlist
redx
refresh
Rn
S
SBus
scard
shared memory domain.
signature
signature block
SIMM
SMD
SNMP
snmpd
sram
SSP
Spitfire
subtest
syndrome
SYSIO
system board
T
tag
test
transgression error
U
UDB
UE
UltraSPARC
Ultra Enterprise 10000
UPA
UPA Port
V
virtual address
W
wfail
X
XARB
XBAR
The redx command to display XBAR registers is
shxbar.
XDB
The redx command to display XDB registers is
shxdb.
Xfire
XMUX
XOR
Y
Z
0-9
5/A Test
Maintained by:
Dan Drogichen (drog@marvin.west.sun.com)