SunSolve Internal

Infodoc ID   Synopsis   Date
20151   UNIX Kernel Stack Overflows   25 Aug 1999

Description Top

UNIX Kernel Stack Overflows
===========================

Version: 1.3
Last update: 99/08/10

Disclaimer: This document is considered unofficial.  It is issued
	    as an infodoc to clarify and concentrate information
	    on kernel stack overflows for customers and Sun field.

How to use this document
------------------------

Kernel stack overflows are a complex topic, so this document
includes a fair amount of discussion.  You should read and understand
it all before making any modifications to your system - careless
tuning could render your system unusable.  For the impatient,
section 4 contains all the suggestions for avoiding stack
overflows - but be sure you understand the side-effects of
any parameters you change.  If in doubt, read the earlier sections!

1.1 Introduction
1.2 Terminology
2.1 The Difficulty of Dynamically Growing Kernel Stacks
2.2 Stack Usage Rules For Kernel Developers
3.1 Kernel Stacks In Solaris
	3.1.1 Stack Sizes for Different Kinds of Threads
3.2 What Does a Kernel Stack Overflow Look Like?
3.3 Past Causes of Stack Overflow in Solaris
	3.3.1 M64 Graphics Driver ati
	3.3.2 setcontext()
	3.3.3 RPC svc_run
	3.3.4 Pagefault Handling
	3.3.5 Recursion in prrealvp
3.4 Mapping the Red Zone
4.1 Preventing Kernel Stack Overflows in Solaris
	4.1.1 Kernel Parameters Available for Tuning
	4.1.2 Consequences of Tuning Stack Sizes
	4.1.3 Recommendations

1.1 Introduction
----------------

This infodoc aims to describe kernel stacks in sufficient detail to
discuss the notion of kernel stack overflow.  The difficulty of
kernel stack growth is not unique to Solaris or to the SPARC
architecture, so the initial description will be generic.
In the final section we will cover Solaris in particular,
and discuss techniques for minimising the chances of kernel
stack overflow in Solaris.

Stack overflows are not at all common - the number of distinct
cases is small and most of these are addressable in patches.
For large and busy systems that have some combination of
layered drivers (VxFS, VxVM, DiskSuite) and disk subsystems
that require more elaborate drivers (rm6, fca etc) it may be
worth taking some precautions to avoid unnecessary panics.

1.2 Terminology
---------------

To make all the distinctions we require, we'll find a bit
of terminology necessary.  These definitions won't be complete in
all detail, but they'll be good enough to discuss kernel stacks in
as much depth as we need to.

User Process:
	A user process is a running instance of some user program
	(which must have been resident in the the filesystem somewhere,
	such as /bin/ls or /usr/dt/bin/dtmail).

Address Space:
	Each user process has its own unique address space so that one user
	process cannot directly violate the memory space of another.  The
	valid ranges of virtual addresses within the address spaces
	of two distinct processes will often look similar (eg, executable
	code normally starts at address 0x10000), but will be mapped
	by the kernel to different physical memory addresses.

Segment:
	An address space is made up of a number of segments.  Typical
	segment types for a user process are text (executable code),
	heap (global program data), shared library (mapped in at process
	creation time) and stack (used to store the processes main stack).
	You can see the segments comprising the address space for any process
	using /usr/proc/bin/pmap.

32-bit Program:
	In a 32-bit program virtual memory addresses are formulated
	using 32-bits, so the addressable range is 0 to 2^32 - 1 - a range
	of 4GB.  This means that a single 32-bit process can address
	up to 4GB of virtual memory.  There was a time when 4GB was a lot
	of address space, but increasingly programs are finding the
	need to address more than this.

64-bit Program:
	Similarly, a 64-bit program formulates virtual memory addresses
	in 64-bits and has an address range of 0 to 2^64 - 1.  This is
	an absolutely vast range of addresses - we can map very large
	individual segments into a 64-bit address space and we can even
	leave very large "holes" between the segments.

32-bit Kernel:
	A 32-bit kernel is a kernel that uses 32-bit addresses.  Being
	a 32-bit program, a 32-bit kernel can use at most 4GB to store
	all its own executable code and data structures.  Since the kernel
	is responsible for all aspects of the system, it must maintain enormous
	numbers of data structures (eg, a structure to keep track of
	every process created in the system, a structure to manage every
	physical page of memory).  As systems become bigger and more complex,
	the 4GB that a 32-bit kernel has available in which to store its
	data structures has become ever more crowded.

64-bit Kernel:
	A 64-bit kernel, on the other hand, uses 64-bit addresses and
	can therefore address a vast amount of memory for storing its
	own data structures.

User Thread:
	Originally a user process had just a single thread of control.
	Execution started at the main() function and traced subsequent
	code and function calls in a single path.  To perform tasks in
	parallel, a process would fork() a child to perform some work.

	More recently user programs have become multithreaded - multiple
	threads of control.  Execution still starts at main() but the
	process can create additional threads of control through calls
	to an API.  The resulting "user level threads" can perform
	tasks in parallel, and even run simultaneously in a multi-CPU
	system.

	A user process is, therefore, comprised of a number (perhaps only 1
	in the single-threaded case) of user threads.  These user threads
	all share the address space of the process within which they reside.
	Multithreaded applications usually employ from two to several
	tens of threads, but some applications are also written to
	use hundreds or thousands of threads.

Kernel Thread:
	Modern UNIX kernels are also multithreaded, meaning that we
	have multiple threads of control within the kernel.  A typical
	kernel will have created hundreds or even thousands of kernel
	threads.

	Some kernel threads exist only to support the system calls
	made by user-level threads.  When a user process (ie, some
	thread in that process) requires a service of the system
	(necessarily provided by the kernel) it performs a system
	call into the kernel, and one of the kernel threads created
	to support that process performs the requested service.

	The kernel threads that support a particular process also
	perform some transparent work on behalf of the process.
	For example, if a process accesses an address in a page that
	has been paged out to swap this will generate a page fault.
	Handling this page fault (page in from swap) requires the
	kernels intervention, and the kernel thread supporting the
	user thread that "pagefaulted" will perform the necessary work -
	the user thread will later resume without even knowing that
	a pagefault took place.

	Some kernel threads are "pure" kernel threads.  These don't perform
	services directly for user processes, but instead perform
	background and housekeeping tasks.  Examples are pageout, fsflush
	and the kernel RPC threads.

Stack Frame:
	In C every function that is called has a corresponding
	stack frame (except for so-called leaf functions where the
	compiler can sometimes optimise the stack frame away).  This provides
	storage for the CPU registers in use in that function (mostly
	we'll work within the CPU registers themselves, but at times
	such as when switching to a new process to run or calling another
	function from within the current function we may need somewhere
	to store the current register values).  The stack frame also
	provides storage for the local variables of the function.

	Not all stack frames are the same size - they vary depending on
	the number and size of local variables in the function.  The
	minimum stack frame size (one which just provides storage for
	registers but no local variable storage) is usually less than
	100 bytes.

Stack:
	Abstractly, a "stack" is a linear list from which insertions and
	deletions are made from only one end.

	The stack for a given thread is a linear list of stack frames.
	As a function call is made a new stack frame is allocated and
	inserted at the bottom of the stack.  When the function call
	returns (perhaps after having made further function calls)
	its stack frame is removed from the bottom of the stack.

	Stack frames that are logically adjacent in the stack
	(ie, the stack frames for two functions one of which has called
	the other) are usually physically adjacent in virtual memory (the
	processor instructions that manipulate stack pointers simply
	increment or decrement the current stack pointer).  This means
	that if we have an area of memory, say one 8K page, allocated
	to hold a particular thread's stack and have the two adjacent
	pages (one above and one below this stack page) in use for
	other purposes that we cannot easily grow the stack outside
	of the 8K page it started in.
	
User Stack:
	For a single-threaded process the stack resides within the
	stack segment of the process address space.  The stack
	segment usually starts out reasonably small (8K) and the
	initial stack frame (for main) is allocated at the top of this.

	If during process execution the stack grows to the extent that
	we will "drop off" the bottom of the stack segment (remember
	that consecutive stack frames are usually allocated in adjacent
	memory ranges) the kernel can catch this access and
	quickly increase the stack segment size (provided no ulimit
	has been exceeded).

	In order to allow for the possibility of stack segment growth,
	the virtual address range of the initial stack segment is chosen
	so as to have a virtual address space "hole" below it into
	which we can grow the stack segment.  If instead the initial
	stack segment were placed immediately adjacent to another
	address space segment we would not be able to grow the stack
	segment.

	For multithreaded user processes, each thread is allocated
	its own stack at the time it is created (the process starts
	with one thread and can create others from there).  Unlike
	the single-threaded case, we are unable to dynamically
	grow the stacks of these threads.  The reason is quite
	simple - leaving the necessary virtual address space "hole"
	below each allocated stack into which we could grow
	can soon exhaust the 32-bit address space (4GB) that
	a 32-bit process can access.

Kernel Stack:
	All the kernel threads within the kernel share the
	same address space (just like all user threads within
	a single multithreaded process share an address space).

	Each kernel thread is allocated its own stack at the
	time it is created.  It immediately becomes obvious that
	it will be difficult to space these stacks in a 32-bit
	address space in such a way that there is room for
	growth beneath every stack while still leaving much
	room for the kernel to store other material.

	"Pure" kernel threads always run on their allocated (kernel) stack.
	User threads run on their allocated stack until such time
	as they make a system call or until the kernel handles something
	like a pagefault on their behalf.  At this point we
	switch to running the kernel thread using its allocated
	stack.  When the system call or whatever is complete
	we return to running the user thread on its stack.

2.1 The Difficulty of Dynamically Growing Kernel Stacks
-------------------------------------------------------

As discussed above, all kernel threads share the same
address space ("kernel address space").  On
pre-sun4u architectures kernel address space is limited
to just a fraction of the 32-bit range - no more than 251MB
on sun4d for example.  In sun4u the kernel has a larger
maximum address space size - for 32-bit kernels
it can address up to 4GB and 64-bit kernels can address
the full 64-bit address range.

If we accept that stack frames will be adjacent in
virtual memory (as they usually are) then to grow
a stack we need to have virtual address space free
below the current stack into which we can grow.
This can be a waste of kernel VM space - this space
is often needed for other kernel structures.

An alternative is to create all kernel threads with a bigger
default stack size - if we can't dynamically grow the stack
then we can at least give it more room.  This solution is
adequate if we have plenty of VM space to spare - either
a 32-bit kernel that does not have high kernel memory
demand, or a 64-bit kernel that has masses of VM space.
Of course the penalty is that we'll use a little more
memory for every kernel thread.  Even if we increase the
size of kernel stacks, kernel developers are still obliged
to be conservative with their stack usage.

It is also difficult to detect *when* a stack needs be be grown.
Typically all stacks are allocated with a single unmapped page
below them acting as a "redzone".  If something tries to
access this redzone page it will generate a fault (because the
page has no mapping).  We could assume that this fault is
the thread trying to access memory beyond its current stack
and quickly map in a new page (this is not ideal - we've now
lost the redzone and the next page the stack grows towards
could be some other data) but this does not distinguish the
case where some rogue writer has targeted a random thread's
stack redzone.

Instead the solution has been to impose a rule upon kernel
programmers: "do not use too much kernel stack".

2.2 Stack Usage Rules For Kernel Developers
--------------------------------------------

There are a few rules that kernel programmers must adhere to
in order to avoid excessive stack usage:

	1) Avoid requiring excessive local variable storage.  For example,
	   allocating an array on stack should be avoided unless the array
	   is only small - declare the array to be static or kmem_alloc
	   memory for it (so that it resides on the kernel heap instead).

	2) Avoid excessively deep function call stacks.  Even if
	   each function is not excessively greedy they can all add
	   up to a substantial stack requirement.

	3) Recursion can only be used where you know it will terminate
	   in just a few iterations.

	4) Be aware that your code is not the only user of a thread's
	   kernel stack - fault handling (eg, pagefault) may occur
	   transparently to your code but may also use your stack.

	5) If mainline code needs to perform a subsidiary or housekeeping
	   task see if this task can be deferred - tasks within tasks
	   double up on stack usage.  See if you can use a separate
	   thread which can be signalled to perform housekeeping,
	   or use callouts or task queues (recently introduced in Solaris 7).

3.1 Kernel Stacks In Solaris

In 32-bit Solaris kernels (ie, releases 2.6 and earlier and Solaris 7 or
later when booted 32-bit) the normal/common kernel stack size is 8K.
On sun4u this is a single page - on earlier architectures such as sun4m
and sun4c this is two pages.  For 64-bit kernels the normal/common
stack size is 16K.

All the stacks for all the kernel threads that exist are allocated
from the 'segkp' kernel address space segment.  Stacks are usually
allocated with a "redzone" - a page of virtual address space below
the stack (ie, in the direction of stack growth) that does not
have a valid mapping to a physical page.  When a stack grows
to the extent that a write to the latest stack frame attempts
to access the redzone for this stack this will cause a fault
(because the page has no mapping).

It's worth noting that even on sun4u the segkp segment mentioned above has a 
maximum size.  In releases 2.6 and earlier this was 512MB,
and from Solaris 7 (with kernel update 106541-04 or later) this is
increased to 24GB if running the 64-bit kernel.  This is important:
with segkp limited to 512MB on 32-bit kernels and with all
kernel stacks and their abutting redzones being allocated from
segkp, increasing the default stack size (as we'll see how
to do later) means we can create fewer kernel threads before
exhausting segkp.  This is a consideration on very large
configurations using 32-bit kernels.

Solaris kernel stacks are not dynamically growable.  If a stack overflows
(ie, we attempt to write some register to the current stack frame
and hit the redzone) then the system will panic (see next section for
details).

3.1.1 Stack Sizes for Different Kinds of Threads

When some part of the kernel needs to create a new kernel thread,
it can specify the stack size to allocate for the new thread or
it can accept the default.  Most callers accept the default,
and a minority request a larger stack because they know that
they're likely to need the extra room.

Default Stack Size:
	If no stack size is specified by the caller, then a default
	is used.  In releases 2.6 and earlier this default is 8K.
	For Solaris 7 and later booted 32-bit, the default is also 8K.
	For Solaris 7 and later booted 64-bit, the default is 16K.

	The default stack size is controlled by the kernel
	variable _defaultstksz.  Unfortunately, because of
	bug 4025675 (can't set variables in /etc/system if they
	begin with an underscore), this setting is not easily
	tuned (see section 4.1.3).

LWP Default Stack Size:
	A given user process will have one or more (if it is
	multithreaded) kernel threads that support it (as described
	in the terminology section).  These kernel threads are
	created with stack size controlled by the kernel variable
	lwp_default_stksize.  The default value for this parameter
	is that of _defaultstksz, but lwp_default_stksize
	can be easily tuned from /etc/system (see section 4.1.3)
	These will be the stacks used during system calls and
	when handling faults (eg, pagefault) from user processes.

STREAMS Thread Stack Sizes:
	The STREAMS system creates a number of pure kernel threads
	(background, liberator, writer_thread).  These threads are all
	created with a stack size of 2 * PAGESIZE (16K for sun4u systems).
	This is not tuneable.

Kernel RPC Stack Sizes:
	Solaris supports in-kernel RPC (remote procedure call).
	The main service offered using in-kernel RPC is NFS.
	Threads created to handle incoming RPC requests are
	allocated a stack size controlled by the variable
	svc_run_stksize in the rpcmod module.
	
3.2 What Does a Kernel Stack Overflow Look Like?

When a kernel thread suffers a stack overflow (write into redzone)
a fault will be generated (kernel trying to access a virtual memory
location for which there is no backing physical memory).  Because
we cannot dynamically grow kernel stacks, we cannot "make things right"
in the fault handler - instead we are obliged to panic.  In fact the
kernel doesn't even know that it is a stack overflow that we are suffering -
all it knows is that some thread has tried to access an illegal memory
location.  That's why the typical panic message is a little cryptic:

panic[cpu10]/thread=0x30a47e80: Kernel panic at trap level 2, trap reason 0x2
TL=0x1 TT=0x47 TICK=0x80000498cbcfdf2a
        TPC=0x10008e3c TnPC=0x10008e40 TSTATE=0x4480001e01
TL=0x2 TT=0x68 TICK=0x80000498cbcfdedf
        TPC=0x10006720 TnPC=0x10006724 TSTATE=0x4480001502

That's typically how a stack overflow on a sun4u system will appear -
while already handling a trap (running at trap level 0x1) we have tried
to write to the stack redzone which causes another trap.  The handler
for the second trap determines that this was illegal and panics the
system.

Less often we overflow the stack while at trap level 0x0 (not already
handling a trap).  In this case we get the less cryptic panic message
"segkp_fault: accessing redzone".

On earlier architectures (sun4c, sun4m, sun4d), which do not support
nested trap levels, we'd simply have seen a "Watchdog Reset" message.

Determining who has overflowed the stack and why is rather more
difficult.  One has to delve inside the panic crash dump and
reconstruct things ... all beyond the scope of this infodoc.

3.3 Past Causes of Stack Overflow in Solaris

Some stack overflows have occured because a kernel programmer
(either core kernel or some device driver either supplied with the
OS or with an unbundled or third-party product) has not adhered
to the rules of section 2.2.  Other overflows have occured
because of multiple "layered drivers" - no one component violating the
rules of section 2.2 but the layers adding together to eventually
exhaust the thread's stack.  We'll take a look at some examples
before discussing techniques of avoiding these scenarios in the
next section.

3.3.1 M64 Graphics Driver ati

The call stack for this bug (4112097) was as follows:

	sys_tl1_panic
	ddi_copyin
	ati_ioctl
	spec_ioctl
	ioctl
	syscall_trap

This stack is not very many levels deep.  A user thread has made
an ioctl() system call for the ati device.  The fault here was that
the ati_ioctl() function requested an excessively large stack frame
because of a large local array - ati_ioctl used a stack frame
of 2496 bytes which is too large a share of the default 8K stack
for a single function to request.

Increasing lwp_default_stksize would have helped in this case - this
kernel thread supports a user thread and so its stack size was
determined by lwp_default_stksz.  But the correct fix (as per
patches 103792-08 and 105362-07) is to rearrange ati_ioctl's
local variable usage.  A bigger lwp_default_stksz would have
hidden the problem for longer, or perhaps altogether.

3.3.2 setcontext()

In this bug (4098645) the setcontext() function requested a large
stack frame, leaving less room for further stack frames.  In that
way it is similar to the M64 driver bug in 3.3.1.  However this bug
illustrates how layered drivers can increase the demands on stack space.
Here's a sample call stack (take a deep breath, read stacks from the bottom
towards the top):

unix:trap
sd:sddone
genunix:thread_unpin()
unix:swtch()
unix:mutex_adaptive_enter
genunix:rmfree
sbus:iommu_dma_unbindhdl
isp:isp_scsi_destroy_pkt
scsi:scsi_destroy_pkt
sd:sddone
isp:isp_i_call_pkt_comp
isp:isp_intr
sbus:sbus_intr_wrapper
unix:intr_thread
genunix:rmalloc
sbus:getdvmapages
sbus:iommu_dma_setup
sbus:iommu_dma_bindhdl
genunix:ddi_dma_buf_bind_handle
isp:isp_scsi_init_pkt
scsi:scsi_init_pkt
sd:make_sd_cmd
sd:sdstart
sd:sdstrategy
genunix:bdev_strategy
md:md_stripe_strategy
md_mirror:mirror_read_strategy
md:mdstrategy
genunix:bdev_strategy
specfs:spec_startio
specfs:spec_pageio
genunix:swap_getapage
genunix:swap_getpage
genunix:anon_getpage
genunix:segvn_faultpage
genunix:segvn_fault
genunix:as_fault
unix:pagefault
unix:trap
unix:flush_user_windows_to_stack
genunix:savecontext
genunix:setcontext
genunix:syscall_ap

Here's the story (read the stack from the bottom up):  A user
thread performs a setcontext(2) system call.  While handling
this system call we trap for a pagefault (we've tried to
access a page not presently in memory but can be paged in).
In swap_getapage() we start the I/O to retrieve the page
from the swap device.  The swap device is mirrored using
DiskSuite (md modules) so we need to pass through some
DiskSuite layers (eg, to decide which side of the mirror
to read from this time).  This done, the sd (scsi disk)
driver prepares a SCSI command packet and submits it.
This disk resides on a SCSI bus controlled by an isp
controller, so the isp driver also contributes.  At this
point the current thread is pinned by an interrupt thread
(this hijacks the current thread for a moment and borrows
its stack).  The interrupt has been raised to notify the
completion of some I/O, and some completion processing
is performed.  Just at the point that the interrupt thread
is about to passivate and unpin the current thread, we finally
make one stack frame request too many and we fault (trap)
and panic.

Because setcontext() has used an unreasonable amount of stack
space very early on, there was no room for all the performance
that followed.

Again, this was a user thread that performed a system call - so
increasing lwp_default_stksize would likely have avoided the
stack overflow and the subsequent panic.  The excessive stack usage
on the part of setcontext() was corrected in 2.5.1 patch 103640-20
and 2.6 patch 105181-05.

3.3.3 RPC svc_run

The svc_run kernel threads handle kernel RPC requests, such as
NFS.  Very often layered drivers will come into play with
these service threads as requests are made to the filesystem.
At one point (corrected now in kernel update patches) the
svc_run function itself was a little greedy in terms of
local variable usage (and therefore its stack frame size).
Because of the layered drivers that commonly appeared in 
svc_run thread stacks (see below) this extra demand from
svc_run proved to be just enough to cause stack overflows.

Here is a typical stack from an overflow in a svc_run thread:

sd:make_sd_cmd
sd:sdstart
sd:sdstrategy
genunix:bdev_strategy
rdriver:rd_strategy
genunix:bdev_strategy
vxdmp:dmpstrategy
genunix:bdev_strategy
vxio:voldiskiostart
vxio:volkiostart
vxio:vxiostrategy
genunix:bdev_strategy
vxfs:vx_ldlog_write
vxfs:vx_logbuf_write
vxfs:vx_logbuf_io
vxfs:vx_logbuf_flush
vxfs:vx_logflush
vxfs:vx_tranlogflush
vxfs:vx_trancheck
vxfs:vx_mapstrategy
vxfs:vx_bc_bwrite
vxfs:vx_bwrite_cmn
vxfs:vx_unlockmap
vxfs:vx_extprevfind
vxfs:vx_extentalloc
vxfs:vx_do_bmap_typed
vxfs:vx_do_bmap_typed
vxfs:vx_do_bmap_typed
vxfs:vx_bmap_typed
vxfs:vx_bmap
vxfs:vx_get_alloc
vxfs:vx_alloc_getpage
vxfs:vx_do_getpage
vxfs:vx_write_alloc
vxfs:vx_write1
vxfs:vx_write
nfssrv:rfs3_write
nfssrv:rfs_dispatch
rpcmod:svc_getreq
rpcmod:svc_run

An RPC request has arrived and been decoded as an NFS version 3 write
request.  We perform a write operation on the associated vnode, which
happens to reside in a VxFS filesystem.  We therefore have to go
through the VxFS layer before any real I/O request is actually made.
Even then, the filesystem resides on a VxVM managed volume and so
the vxio layer gets involved.  It's not over yet - the Veritas
DMP layer gets to make some decisions and then because the physical
disk is in a hardware RAID system we pass through the rdriver (rm6)
before finally starting the real I/O request in the sd driver.

In this case, VxFS (version 3.2.2) was also on the greedy side in terms
of stack usage.  Subsequent versions of VxFS trimmed down their local
variable usage to be kinder to the stack.  VxFS documentation now also
suggests modifying the stack size for svc_run threads (see below).

Even if none of the svc_run function itself nor any of the VxFS functions
above had been at all greedy in stack usage, the above is always going to
be a reasonably deep stack.  This is why the VxFS documentation 
recommends tuning the svc_run stacksize (as below).

In releases 2.5.1 and later (for 2.5.1 you'll need kernel update
103640-28 or later) it is possible to tune the svc_run stacksize.
This is done with

	set rpcmod:svc_run_stksize=0x4000

in /etc/system (0x4000 is 16K in hex, an increase from the
usual value of 8K).

The svc_run threads are pure kernel threads - they do not support
user level threads directly.  As such, the lwp_default_stksize
variable has no affect on them.

3.3.4 Pagefault Handling

A pagefault occurs when we try to access a virtual memory
address which is valid but not currently loaded (or "mapped")
into physical memory.  The pagefault handler retrieves the
required page that includes the desired memory address
from some file or from the swap devices.  Clearly this
is a case likely to involve layered drivers (DiskSuite,
VxFS, VxVM etc).

Pagefaults can occur at almost all stages of execution.  For example:

 . user applications not currently running in kernel mode (ie, outside
   of a system call) 

 . during system call processing

 . pure kernel threads, such as svc_run threads, can pagefault


When layered products (VxFS, VxVM, DiskSuite, disk hardware drivers, ...)
are present, handling a pagefault could likely involve passing through
many of these layers.  It's even possible that we'll have to handle
a pagefault when already well down into the layered products (eg,
handling an NFS read request in svc_run where the requested page
is not presently in memory) in which case many of the layered
products could appear twice in our stack.  Even if no layer is
at all guilty of excessive stack usage, in extreme cases it can all
add up to cause a stack overflow.

One way of avoiding the possibility of stack overflow on pagefault
handling is to increase stack sizes.  Since pagefaults are not limited
to any one type of kernel thread, you need to modify both
lwp_default_stksize and rpcmod:svc_run_stksize.

3.3.5 Recursion in prrealvp

In this bug the prrealvp() kernel function called itself
recursively with no possible termination condition - infinite
recursion.  After around 40 iterations stack would be exhausted
and the system would panic (or watchdog reset on non sun4u).
An example stack (from a WebNFS operation) was:

...
procfs:prrealvp
procfs:prrealvp
procfs:prrealvp
procfs:prrealvp
nfssrv:rfs_publicfh_mclookup
nfssrv:rfs3_lookup
nfssrv:rfs_dispatch
rpcmod:svc_getreq
rpcmod:svc_run

This was fixed in Solaris 2.6 kernel update 105181-01.

3.4 Mapping the Red Zone
------------------------

Solaris 2.6 and 32-bit Solaris 7 and later kernels (but not 64-bit kernels)
contain a mechanism designed to avoid stack overflows associated with
pagefault handling.  This is a bit of an uglyism that comes about
because of the limited kernel address space in a 32-bit kernel
(4GB does not go long way these days) - that's why the 64-bit
kernels do not include this mechanism.

At the top of the pagefault function a check is made to see
how much room is left on the current stack.  If there is
less than red_minavail (a kernel variable tuneable in /etc/system,
default value 5000 bytes) remaining then we temporarily
extend the stack for this thread to include its redzone page
(gaining an extra 8K of stack space on sun4u).  At the end
of pagefault handling we unmap the redzone page again.

The difficulty with mapping the redzone of a thread's stack
is that we no longer have a redzone!  If, for example, the
thread was recursing more deeply than allowed then it's possible
that it would rapidly consume even the temporarily extended
stack and, once beyond that, might be corrupting the stack
page for some innocent thread.  But since most overflows
only wanted to overflow their stack by just a little,
the technique is quite successful.

The other weakness with this approach is that sometimes we might
fall only marginally on the good side of red_minavail (eg,
5004 bytes remaining with red_minavail = 5000) but still
require more stack than remains for handling the pagefault.
In this case we would not map the redzone, and would
suffer a stack overflow.

A solution is to increase red_minavail, say to 7000.  This
value would actually cause the redzone to be mapped on
*every* pagefault (the stack page is usually 8K but 1K of that
can be used for some thread data and the space available
for stack is often only 7K).  A value of 6000 would avoid
mapping the redzone every time, but we'd still map it more
often and there will be a small performance penalty (each time
we need to find a page that can back the redzone and then
map the redzone to this page).  It is better (on systems with
enough memory) to increase default stack sizes instead (using
lwp_default_stksize and rpcmod:svc_run_stksize).

4.1 Preventing Kernel Stack Overflow in Solaris

4.1.1 Kernel Parameters Available for Tuning

lwp_default_stksize
	This is the stacksize used for kernel threads that support
	user processes.  Typically we find that most kernel threads
	in a system are of this nature (because every process has at least
	one support kernel thread).  Tuning this can assist with:

	. stack overflows occuring during system call processing (eg, 
	  a pagefault that needs to be handled during a system call)

	. stack overflows that could occur while handling a pagefault
	  from a user process not running in kernel mode (these are
	  rare)

	. stack overflows resulting from an interrupt borrowing the
	  stack of an LWP it pins

	Default value is 8K on 32-bit kernels and 16K on 64-bit kernels.

rpcmod:svc_run_stksize
	This is the stack size used for kernel threads that service
	RPC calls to the kernel (NFS being the prime example).
	These threads are often subject to multiple product code
	layers and pagefaults are often handled during processing.
	If a system is using VxFS, VxVM, DiskSuite etc (and especially
	if there are also dedicated disk product drivers such as rm6,
	fca, etc) then it may be worth increasing this value.
	NFS servers using layered products would be well advised
	to increase this value.

	Default value is 0 - which means we will use the value for
	_defaultstksz unless we have tuned svc_run_stksize in /etc/system.

	In releases 2.5.1 and later (for 2.5.1 you'll need kernel update
	103640-28 or later) it is possible to tune svc_run stacksize.

_defaultstksz
	This is the stack size allocated when no size is specified
	in the function call to create a new kernel thread.
	This value CANNOT be tuned in /etc/system (see bug 4025675).
	This value does not overrule lwp_default_stksize or svc_run_stksize
	(ie, it is a default value when none is specified, it is not
	a minimum stacksize specification).

	Not being able to tune _defaultstksz in /etc/system is not as
	much of a disadvantage as one might think - the stack sizes
	that can't be influenced through use of lwp_default_stksize
	and svc_run_stksize are the stacks for pure kernel threads
	which have not been known to suffer stack overflows.

red_minavail
	See section 3.4 for a full discussion of this parameter.
	On systems with adequate amounts of memory it is preferable to
	tune lwp_default_stksize and rpcmod:svc_run_stksize rather than
	modify red_minavail.  Mapping redzones is only ever attempted
	on 32-bit kernels, and also applies *only* to pagefault handling
	(so would not help with stack overflows from other causes).

4.1.2 Consequences of Tuning Stack Sizes

Increasing default stack sizes will, of course, increase the
amount of memory consumed by the kernel.  If you have a considerable
amount of memory in your system then chances are that you can ignore
this.  But some systems have very large numbers of kernel threads -
for example systems that have a very large number of processes running
at one time (eg, if a new process is started for every user that connects,
or if users actually login to the system via telnet or rlogin) or
a large number of pure kernel threads (systems with hundreds of
individual filesystems, NFS servers serving a very high demand) -
and the extra memory for every thread stack can mount up to
something considerable.

To retrieve the parameters that we're interested in for assessing
the memory impact of increasing stack sizes, run the following
script as root on your server during typical/peak production:

___ cut here ___
#!/bin/sh

loi="D"
if [ -x /usr/bin/isainfo ]; then
        if [ "`/usr/bin/isainfo -b" = "64" ]; then
                loi="E"
        fi
fi

adb -k <<EOM
pagesize/40tD"<---"nn
_defaultstksz/40t${loi}"<---"nn
lwp_default_stksize/40tD"<---"nn
svc_run_stksize/40tD"<---"nn
nthread/40tD"<---"nn
EOM

if [ -f /usr/sbin/sar ]; then
        junk="`/usr/sbin/sar -k 1 1 | tail -1`"
        smlmem=`echo $junk | cut -d' ' -f2`
        lgmem=`echo $junk | cut -d' ' -f5`
        ovsz=`echo $junk | cut -d' ' -f8`
        total=`expr $smlmem + $lgmem + $ovsz`
        mb=`expr $total / 1024 / 1024`
        echo "Estimated kernel memory usage: ${mb}MB"
else
        echo "Kernel usage: sar not installed (package SUNWaccu)"
fi
___ cut here ___

Here's a sample output:

___
physmem 7942
pagesize:
pagesize:                               8192            <---

_defaultstksz:
_defaultstksz:                          8192            <---

lwp_default_stksize:
lwp_default_stksize:                    8192            <---

svc_run_stksize:
svc_run_stksize:                        0               <---

nthread:
nthread:                                166             <---


Estimated kernel memory usage: 29MB
___

The physmem value is in hex, all other values are decimal.

This system has 0x7942 (31042) pages of memory available to work
with.  Each page is 8192 bytes, so this amounts to 242MB of memory
(the system has 256MB of physical memory).  The kernel is using
approximately 30MB of memory (as a rough guide, 32-bit kernels
use 10-15% of physical memory; for 64-bit kernels this is increased
to 15-20%).  This system has a 32-bit kernel.

There are 166 kernel threads present in the system.  This is a small
number - the number of threads in a kernel ranges from a couple of
hundred to several thousand in current configurations.

All the stack size tuneables are at their default values.  To estimate
the amount of memory currently used for kernel stacks, we can multiply
the number of threads present by the typical stack size (this is just
a guide - as mentioned above some threads such as STREAMS threads
are created with a specified stack size not controlled by the above tuneables).
This system is using around 166 * 8192 = 1359872 bytes, or 1.3MB,
of physical memory for kernel stacks.  We could easily increase
lwp_default_stksize and svc_run_stksize to 16K and not miss the
additional memory consumed.

But consider the case where the system typically has 10000 kernel threads.
With (default) 8K stacks as the typical stack size (without tuning)
this would use around 78MB of physical memory.  Increasing typical
stack size to 16K would double this figure - some systems might
miss the additional 78MB consumed.

4.1.3 Recommendations

On 64-bit Solaris kernels you should not have to perform any
tuning for kernel stack sizes.  If a stack overflow is suffered
it is most likely due to a severe problem with some kernel module
or driver (eg, infinite recursion) and the correct path is to
fix or eliminate this component.

For 32-bit Solaris kernels (releases 2.6 and earlier, or Solaris 7
and later booted 32-bit - 'isainfo -b' will tell you how you've booted)
some precautions may be necessary.

Firstly, be sure to avoid known bugs that cause stack overflows (see
section 3.3):

 . for Solaris 2.5.1:
	- 103640-20 fixed the setcontext() bug mentioned in section 3.3
	- 103640-28 enables the tuning of rpcmod:svc_run_stksize

 . for Solaris 2.6:
	- 105181-01 fixed the prrealvp() bug
	- 105181-06 fixed the setcontext() bug

 . apply the M64 driver patch if you have this graphics framebuffer
   installed (Ultra 5, Ultra 10)

 . upgrade VxFS if you're still running version 3.2.2

Next consider whether you might need to increase default stack sizes:

 . if you're using a few of the "layered products" such as DiskSuite,
   VxFS, VxVM (particularly if also using disks that employ
   additional drivers such as rm6, fca) then increasing default
   stack sizes may be wise if you have sufficient memory (see calculations
   above).  VxFS documentation advises increasing svc_run_stksize.
   This will help with kernel RPC, but increasing lwp_default_stksize
   in addition to increasing svc_run_stksize will help with more cases.

 . check the memory implications of increasing stack sizes (as above)

 . do NOT try to tune _defaultstksz unless/until bug 4025675 is fixed
   (this includes trying methods other than /etc/system)

 . increasing stack sizes above 16K is a wholesale waste of memory

 . if you decide to increase stack sizes, increase both lwp_default_stksize
   AND svc_run_stksize - to do only one would be a half measure

 . it is preferable to modify lwp_default_stksize and svc_run_stksize rather
   than just tuning red_minavail; it doesn't make much sense to tune all
   three parameters.  If you decide to tune red_minavail (default value 5000)
   then don't change the other two and use values no more than 7000 (which
   will force a redzone mapping on *every* pagefault and have a small
   performance impact).

To tune svc_run_stksize and lwp_default_stksize add the following lines
to /etc/system:

___ cut here ___
* increase lwp and svc_run stack sizes to 16K
set lwp_default_stksize=0x4000
set rpcmod:svc_run_stksize=0x4000
___ cut here ___

You'll need to reboot for these to take effect.  For Solaris 2.5.1
you will require 103640-28 before you can successfully tune
svc_run_stksize.
Product Area Kernel
Product Config
OS any
Hardware n/a

Top

SunWeb Home SunWeb Search SunSolve Home Simple Search

Sun Proprietary/Confidential: Internal Use Only
Feedback to SunSolve Team