SunSolve Internal

 

  Simple Search | Advanced Search | Product Search | Tips | Investigation Wizard

 Search for in

Printer Friendly Page ] [ E-mail this Document to Someone ]
Was this document useful? Yes or No ]

Jump to
Infodoc ID   Synopsis   Date
14525   Common causes of SSA OFFLINE problems   20 Feb 1997

Description Top
Common CAUSES of the OFFLINE problems:

In order of the most likely cause of the majority of Offline/Online
messages:

	Fiber channel hardware.
	   -The FC/OM (optical module) needs to be a '-03'(or higher) at the 
		end of the part number.  The lower revisions had problems.
		Check both ends of the cable.
	   -Fiber cable could be dirty, or have a loose connection.  
		(Be sure to always use the end caps to protect cable if it
		is being moved.
 	   -Fiber cable being bent beyond what it should be, or being 
		broken by someone standing or stepping on it.
	 
	The Array Controller card.
	   -Firmware version is a big issue here.  If running a current 
		version of firmware then there is a possibility of a faulty 
		board.

	Software drivers and/or firmware

	System IO load balance.
	   -We have had a couple of instances where the system was running too 
		many memory intensive applications, causing the SSA to have 
		to 'wait' for CPU time.  This can be 'fine-tuned' by
 		distributing the IO loading in larger systems, or maybe 
		even adding enough memory for all the applications to run 
		(without stepping on each other).  [For assistance in this
		area, please contact your local sales representative.]

How does one know which of the above is the source of the problem?
How can you tell whether these messages are coming from software,
firmware, or the Fiber channel hardware?

Messages.

In order to be able to troubleshoot the problem further, you must look
at the messages.  What is preceding the first OFFLINE, or subsequent
OFFLINE messages?  This will tell you where the most likely source
of the problem really is.

First, what device is reporting the error?  Is it the soc, or the pln?
This will begin to point to the source of the problem.  The soc
is the Fiber channel handler, and the pln is the SSA driver.  If
you see messages relating to 'soc' there is good chance the problem is 
either in the fiber channel hardware.  If there are 'pln' messages,
then it's more likely not the fiber channel, but elsewhere.  The
pln is the driver software that talks to the SSA, so based on the
actual messages, you should be able to find the source of the
problem.

Below is a short 'table' of the most likely sources of 'offline/online'
messages.  They are ordered top (most likely) to bottom (least likely)
per category.  [NOTE: This is a guide based on what we know at this time;
this should not imply these are the only possibilities.]

--------------------------------------------------------------------------
--------------------------------------------------------------------------

Messages information:			Suspect source of problem:
				      (ordered most likely to least likely)
------------------------		---------------------------
Offline/Online -  "plain"		Hardware; Fiber cable
(without any other associated 			  soc**
message indications)				  SSA controller
						

Timeout Recovery -    
   Timeout recovery being invoked	Usually software (75% it is); 
   					    SSA firmware 	
   					    SSA driver
					    SSA hardware
   					    
Transport Error -  			For all of these:
   Transport error:  incomplete		    Bad disk drives
   		     reset		  Software;
   		     timeout		    ssd
   		     data_ovr		    ssa/pln driver
   		        		    SSA firmware
   		        		    kernel
   
   
Transport Rejected - 			Hardware; Fiber cable 
					   or soc**

Media errors -				Disk drives

If any messages with SYS_NOTICE -	SSA firmware
					
	[ NOTE  ** denotes FC/OMs also as a possibility ]

---------------------------------------------------------------------------
---------------------------------------------------------------------------

One of the most frustrating messages is the 'Timeout Recovery' message.
This one in particular needs to be examined a bit more closely.  Check
for any other messages, like any disk related errors, etc.  If so, then
use those to determine the most likely cause of the problem.

Here is one example of what these may look like:

Jan 31 16:17:21 unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@0,0/
SUNW,pln@a0000000,78ad31 (SUNW,pln0):
Jan 31 16:17:21 unix:  Timeout recovery being invoked...
Jan 31 16:17:21 unix: ID[SUNWssa.soc.link.5010] soc0: port 0: Fibre Channel
 is OFFLINE
Jan 31 16:17:22 unix: ID[SUNWssa.soc.link.6010] soc0: port 0: Fibre Channel
 is ONLINE
Jan 31 16:17:22 unix: ID[SUNWssa.soc.login.6010] soc0: Fibre Channel login 
succeeded
Jan 31 16:17:22 unix: ID[SUNWssa.soc.link.1010] soc0: message:  SSA110 V3.9 

When we get Timeout Recovery messages, it means that there was a timeout
on a command sent to the SSA.  In the example above, this is the only 
message; there are no other messages from the SSA or disks, only this
'Timeout Recovery' message followed by the recovery procedure messages
(the subsequent offline and online/login messages).

The software will attempt a recovery by flushing out the transport and
reconnecting.  This is what causes the following offline/online sequence.
It is trying to do the operation again.

A hardware problem on the controller that allows the link to become 
established (online and login succeeds) but the commands are all timing 
out will just continually 'timeout' and go through the recovery 
offline/online over and over again.  If this is all there is in the
messages file and all from one SSA, then the most likely suspect would 
be the controller board itself.  You might also see this on one of 
the isp (scsi controller chips) on the board, but you will also have
other messages relating to those addresses.

Of course, a combination of software and hardware may still be the
cause of problems.  The best you can do is to get the software at the 
most current levels (including disk firmware levels), and from there
most problems may be hardware related.  Basically, try to rule out
one or the other based on versions and messages.
	
EXAMPLES:
Here are some examples of what you might see in a messages file 
relating to offline/online sequences.  See if you can figure out the
source of the problems.

[ I have stripped out the date and system name for space savings. ]
       ----------------------------------------------------
#1)
 07:25:08  unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0
/SUNW,pln@a0000000,740f05 (SUNW,pln2):  
 07:25:08  unix:  Timeout recovery being invoked...  
 07:25:08  unix: ID[SUNWssa.soc.link.5010] soc1: port 0: Fibre Cha
nnel is OFFLINE 
 07:25:09  unix: ID[SUNWssa.soc.link.6010] soc1: port 0: Fibre Cha
nnel is ONLINE 
 07:25:09  unix: ID[SUNWssa.soc.login.6010] soc1: Fibre Channel lo
gin succeeded 
 07:25:09  unix: ID[SUNWssa.soc.link.1010] soc1: message:  SSA100 
V3.6 (031896) Mon Mar 18 19:57:51 1996   
 07:29:28  unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0
/SUNW,pln@a0000000,740f05 (SUNW,pln2):  
 07:29:28  unix:  Timeout recovery being invoked...  
 07:29:28  unix: ID[SUNWssa.soc.link.5010] soc1: port 0: Fibre Cha
nnel is OFFLINE 
 07:29:29  unix: ID[SUNWssa.soc.link.6010] soc1: port 0: Fibre Cha
nnel is ONLINE 
 07:29:29  unix: ID[SUNWssa.soc.login.6010] soc1: Fibre Channel lo
gin succeeded 
 07:29:29  unix: ID[SUNWssa.soc.link.1010] soc1: message:  SSA100 
V3.6 (031896) Mon Mar 18 19:57:51 1996   
 07:30:08  unix: ID[SUNWssa.soc.link.5010] soc1: port 0: Fibre Cha
nnel is OFFLINE 
 07:31:08  unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0
/SUNW,pln@a0000000,740f05/ssd@3,1 (ssd145):  
 07:31:08  unix:  Transport error:  Fibre Channel 
 07:31:08  unix: Offline
 07:31:08  unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0
/SUNW,pln@a0000000,740f05/ssd@3,1 (ssd145):  
 07:31:08  unix:  requeue of command fails (ffffff
 07:31:08  unix: fe)  
 07:31:09  unix: NOTICE: vxvm:vxio: Disk c3t3d1s2: Unexpected stat
us on close: 0
 07:31:09  unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0
/SUNW,pln@a0000000,740f05/ssd@3,1 (ssd145):  
 07:31:09  unix:  transport rejected (-2)  
 07:31:09  unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0
/SUNW,pln@a0000000,740f05/ssd@3,1 (ssd145):  
 07:31:09  unix:  transport rejected (-2)  
 07:31:09  unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0
/SUNW,pln@a0000000,740f05/ssd@3,1 (ssd145):  
 07:31:09  unix:  transport rejected (-2)  
 07:31:09  unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0
/SUNW,pln@a0000000,740f05/ssd@3,2 (ssd146):  
 07:31:09  unix:  transport rejected (-2)  
 07:31:09  unix: NOTICE: vxvm:vxio: Disk c3t3d2s2: Unexpected stat
us on close: 0
 07:31:09  unix: WARNING: /io-unit@f,e1200000/sbi@0,0/SUNW,soc@2,0
/SUNW,pln@a0000000,740f05/ssd@3,3 (ssd147):  
 07:31:09  unix:  transport rejected (-2)  
       ----------------------------------------------------
For #1, if you decided that there is good chance the fiber cable is
the most likely suspect, you win the prize!
       ----------------------------------------------------
#2)
 02:07:29  unix: WARNING: /io-unit@f,e0200000/sbi@0,0/SUNW,soc@2,0/SUNW,
pln@a0000000,78be6e (SUNW,pln1):  
 02:07:29  unix:  Timeout recovery being invoked...  
 02:07:37  unix: WARNING: /io-unit@f,e0200000/sbi@0,0/SUNW,soc@2,0/SUNW,
pln@a0000000,78be6e (SUNW,pln1):  
 02:07:37  unix:  Timeout recovery failed, resetting  
 02:07:37  unix: ID[SUNWssa.soc.driver.1010] soc0: host adapter fw date code:
Wed Jan 17 20:34:59 1996  
 02:07:37  unix:  
 02:07:37  unix: ID[SUNWssa.soc.link.6010] soc0: port 0: Fibre Channel is
ONLINE 
 02:07:37  unix: ID[SUNWssa.soc.link.5010] soc0: port 0: Fibre Channel is
OFFLINE 
 02:07:37  unix: ID[SUNWssa.soc.link.6010] soc0: port 0: Fibre Channel is
ONLINE 
 02:07:37  unix: ID[SUNWssa.soc.link.5010] soc0: port 0: Fibre Channel is
OFFLINE 
 02:07:38  unix: ID[SUNWssa.soc.link.6010] soc0: port 0: Fibre Channel is
ONLINE 
 02:07:38  unix: ID[SUNWssa.soc.link.5010] soc0: port 0: Fibre Channel is
OFFLINE 
 02:07:38  unix: ID[SUNWssa.soc.link.6010] soc0: port 0: Fibre Channel is
ONLINE 
 02:07:38  unix: ID[SUNWssa.soc.link.5010] soc0: port 0: Fibre Channel is
OFFLINE 
 02:07:39  unix: ID[SUNWssa.soc.link.6010] soc0: port 0: Fibre Channel is
ONLINE 
 02:07:39  unix: ID[SUNWssa.soc.link.5010] soc0: port 0: Fibre Channel is
OFFLINE 
 02:07:39  unix: ID[SUNWssa.soc.link.6010] soc0: port 0: Fibre Channel is
ONLINE 
 02:07:39  unix: ID[SUNWssa.soc.link.5010] soc0: port 0: Fibre Channel is
OFFLINE 
 02:07:40  unix: ID[SUNWssa.soc.link.6010] soc0: port 0: Fibre Channel is
ONLINE 
 02:07:40  unix: ID[SUNWssa.soc.link.5010] soc0: port 0: Fibre Channel is
OFFLINE 
 02:07:40  unix: ID[SUNWssa.soc.link.6010] soc0: port 0: Fibre Channel is
ONLINE 
 02:07:40  unix: ID[SUNWssa.soc.link.5010] soc0: port 0: Fibre Channel is
OFFLINE 
 02:07:41  unix: ID[SUNWssa.soc.link.6010] soc0: port 0: Fibre Channel is
ONLINE 
 02:07:41  unix: ID[SUNWssa.soc.link.5010] soc0: port 0: Fibre Channel is
OFFLINE 
 02:07:41  unix: ID[SUNWssa.soc.link.6010] soc0: port 0: Fibre Channel is
ONLINE 
 02:07:42  unix: ID[SUNWssa.soc.login.6010] soc0: Fibre Channel login succeeded

 02:07:42  unix: ID[SUNWssa.soc.link.1010] soc0: message:  SSA110 V3.6 (031896)
Mon Mar 18 19:57:51 1996  
 02:09:28  unix: WARNING: /io-unit@f,e0200000/sbi@0,0/SUNW,soc@2,0/SUNW,
pln@a0000000,78be6e (SUNW,pln1):  
 02:09:28  unix:  Timeout recovery being invoked...  
 02:09:28  unix: ID[SUNWssa.soc.link.5010] soc0: port 0: Fibre Channel is
OFFLINE 
 02:09:29  unix: ID[SUNWssa.soc.link.6010] soc0: port 0: Fibre Channel is
ONLINE 
 02:09:29  unix: ID[SUNWssa.soc.link.5010] soc0: port 0: Fibre Channel is
OFFLINE 
 02:09:29  unix: ID[SUNWssa.soc.link.6010] soc0: port 0: Fibre Channel is
ONLINE 
 02:09:29  unix: ID[SUNWssa.soc.link.5010] soc0: port 0: Fibre Channel is
OFFLINE 
 02:09:30  unix: ID[SUNWssa.soc.link.6010] soc0: port 0: Fibre Channel is
ONLINE 
 02:09:30  unix: ID[SUNWssa.soc.link.5010] soc0: port 0: Fibre Channel is
OFFLINE 
 02:09:30  unix: ID[SUNWssa.soc.link.6010] soc0: port 0: Fibre Channel is
ONLINE 
 02:09:30  unix: ID[SUNWssa.soc.link.5010] soc0: port 0: Fibre Channel is
OFFLINE 
 02:09:31  unix: ID[SUNWssa.soc.link.6010] soc0: port 0: Fibre Channel is
ONLINE 
 02:10:32  unix: WARNING: /io-unit@f,e0200000/sbi@0,0/SUNW,soc@2,0/SUNW,
pln@a0000000,78be6e/ssd@5,1 (ssd41):  
 02:10:32  unix:  Error for command 'write(10)' Err
 02:10:32  unix: or Level: Retryable
 02:10:32  unix: WARNING: /io-unit@f,e0200000/sbi@0,0/SUNW,soc@2,0/SUNW,
pln@a0000000,78be6e/ssd@0,0 (ssd5):  
 02:10:32  unix:  Transport error:  Fibre Channel Of
 02:10:32  unix:  Requested Block 1705952, Error Block: 1705952 
 02:10:32  unix: fline
 02:10:33  unix: WARNING: /io-unit@f,e0200000/sbi@0,0/SUNW,soc@2,0/SUNW,
pln@a0000000,78be6e/ssd@0,0 (ssd5):  
 02:10:33  unix:  Transport error:  Fibre Channel Of
 02:10:33  unix:  Sense Key: Hardware Error 
 02:10:33  unix: fline
 02:10:33  unix: WARNING: /io-unit@f,e0200000/sbi@0,0/SUNW,soc@2,0/SUNW,
pln@a0000000,78be6e/ssd@0,0 (ssd5):  
 02:10:33  unix:  Transport error:  Fibre Channel Of
 02:10:33  unix: WARNING: /io-unit@f,e0200000/sbi@0,0/SUNW,soc@2,0/SUNW,
pln@a0000000,78be6e/ssd@5,1 (ssd41):  
 02:10:33  unix:  requeue of command fails (fffffff
 02:10:33  unix: fline
 02:10:33  unix: e)  
 02:10:33  unix: WARNING: /io-unit@f,e0200000/sbi@0,0/SUNW,soc@2,0/SUNW,
pln@a0000000,78be6e/ssd@0,0 (ssd5):  
 02:10:33  unix:  Transport error:  Fibre Channel Of
 02:10:33  unix: fline
 02:10:33  unix: WARNING: /io-unit@f,e0200000/sbi@0,0/SUNW,soc@2,0/SUNW,
pln@a0000000,78be6e/ssd@0,0 (ssd5):  
 02:10:33  unix:  Transport error:  Fibre Channel Of
 02:10:33  unix: fline
 02:10:33  unix: WARNING: /io-unit@f,e0200000/sbi@0,0/SUNW,soc@2,0/SUNW,
pln@a0000000,78be6e/ssd@0,1 (ssd6):  
 02:10:33  unix:  Error for command 'write(10)' Erro
 02:10:33  unix: WARNING: /io-unit@f,e0200000/sbi@0,0/SUNW,soc@2,0/SUNW,
pln@a0000000,78be6e/ssd@0,0 (ssd5):  
 02:10:33  unix:  Transport error:  Fibre Channel Of
 02:10:33  unix: r Level: Retryable
 02:10:33  unix: fline
 02:10:33  unix:  Requested Block 570688, Error Block: 570688 
       ----------------------------------------------------
Now, #2 is a bit more difficult, and not unlike what might be seen 
on a system.  If we analyze this, based on the information given
about offline/online messages, we come up with more than one possible
source of the problem.   Let's take a look together.

We begin with 'timeout recovery being invoked' which is usually caused
by a software problem; either firmware or the ssa/pln driver.
Next, we see a whole bunch of "plain offline/online messages which is
normally a fiber cable problem; it might not be seated well.
Finally we end up with disk errors; one is a retryable write, the rest
are just reproting 'transport error' fiber offline with hardware sense
error.

So, in this one example we now have three possibilities: bad or loose
fiber cable to the ssa; bad firmware running in the ssa; bad disk(s).
[This messages file continued the same patterns of message outputs 
over and over, listing almost each disk device in the array.]

Based on the evidence here, I would first try the fiber cable, because
it is the easiest, and because if the cable connection is not good,
the communications will not be correct.  If this did not clean up the
problem, I would then try a firmware download.  Normally doing these
two to a system like this one should eliminate most of the extraneous
messages.  Then all that would be left, most likely, might be one
or two bad disk devices.

If after changing out the fiber cable and the firmware and maybe a 
disk or two, if the messages still persist, we still have the balance 
of the hardware for the fiber channel connection, the SSA driver software
package and the kernel.

I wanted to use this example to show that it can be a combination of
things, but normally they are inter-related.

       ----------------------------------------------------
Product Area System Administration
Product Disk admin
OS any
Hardware any

Top

SunWeb Home SunWeb Search SunSolve Home Simple Search

Sun Proprietary/Confidential: Internal Use Only
Feedback to SunSolve Team