E-cache FAQ Friday, 15 September 2000 Q1: What is E-Cache? A1: E-cache is an abbreviation for external cache. External cache is also sometimes referred to as L2 or Level 2 cache. On the microprocessor die there is L1, Level 1 or on-board cache that is in the kilobytes range in size. E-cache is external or separate from the microprocessor itself and physically sits on the same module or daughter board. On our UltraSPARC processors we have different size e-caches with the two most popular sizes being 4 or 8MB on our E3500 to Starfire systems. Q2: What is an E-Cache parity error? A2: Anytime information is moved on a computer system is important to have the appropriate data protection in place to know that the data has made the trip and no bits have been altered along the way. Between the UltraSPARC processor and the e-cache we use parity protection. What this means is that each time we move data to/from the e-cache we check to make sure no bits have flipped from one to zero or vice-versa. If we determined that a bit has been flipped, then *old* Solaris (prior to the kernel update - which we will describe in the next question) lets the application immediately know by issuing a panic. This *old* version of Solaris would then initiate a system reboot with the appropriate error information going to log files. The system then will do a Power On Self Test to determine the health of each component. Properly written applications provide an appropriate mechanism to deal with sudden reboots such as those an e-cache parity error may cause. It is very important to remember and tell our customers that we have not corrupted any data in properly written applications because of this. The *new* Solaris (this is what is referred to as the kernel update, or sometimes the kernel jumbo patch, or sometimes as the kernel level cache scrubber) is GREATLY improved in dealing with e-cache parity errors. IT IS CRITICAL THAT ALL SUN EMPLOYEES - UNDERSTAND THE KERNEL UPDATE AND WHAT IT DOES. - REALIZE THAT EDUCATING OUR CUSTOMERS ON THE KERNEL UPDATE AND GETTING IT INSTALLED ON OUR CUSTOMERS SYSTEMS IS THE MOST IMPORTANT ACTION WE CAN DO TODAY FOR OUR 2.5.1 & 2.6 CUSTOMERS (Solaris 7 & 8 kernel update will be coming in the next 6 weeks) (Solaris 7 & 8 the user level cache scrubber is available today) - WORK WITH YOUR E-CACHE EXPERTS TO DETERMINE IF YOUR SOLARIS 7 & 8 CUSTOMER SHOULD USE THE USER-LEVEL CACHE SCRUBBER Q3: What causes e-cache parity errors? A3: E-Cache parity errors can be the result of an intermittent signal integrity issue (noise). Noise can exist in the circuitry that connects the UItraSPARC-II CPU with the Data Buffer and the e-cache. Another error influencer is a result of ionizing radiation that occurs naturally in the environment. The high density of the e-cache is such that if the e-cache encounters one of these high energy particles, the value or bit in the e-cache can be changed (bit flipped) from a zero to a one and vice versa. Both of the above variables can cause the e-cache data and parity to no longer match. The result is an "e-cache parity error panic" to avoid data corruption from migrating throughout the system. When these errors are transient, which is normally the case, they are classified as "soft errors." We have also found that environmentals in the data center can sometimes make this problem worse. Having the correct temperature, humidity and cooling is very important for any computer system. In some cases we have recommended physical audits to ensure the right environmentals. Q4: What is the e-cache scrubber or the kernel update or the kernel jumbo patch - what is the difference? A4: The e-cache scrubber is a program that is designed flush the cache more often to better protect the integrity of the data. The first version of the e-cache scrubber program was a user level program or simply a program that could be started at the command line like any other program. It would be started up by the system at boot time. It did not modify the kernel. A small set of customers and servers at Sun have realized substantial improvements in reducing the number of e-cache parity errors by using the user-level e-cache scrubber program. The second and greatly enhanced revision of this concept is called the kernel update or kernel jumbo patch - people refer to it both ways. The e-cache scrubber concept was moved into the kernel and was enhanced to provide much better error handling and error messaging. It was only by moving it into the Solaris kernel that we are able to realize these additional benefits. WE STRONGLY RECOMMEND THAT OUR CUSTOMERS INSTALL THIS KERNEL UPDATE ON THEIR SYSTEMS AS SOON AS THE APPROPRIATE RELEASE IS AVAILABLE. http://sunsolve.Ebay.Sun.COM/cgi/retrieve.pl?doc=fins%2FI0616-1&zone_32=I0616-1 (the pointer above is to the kernel update for 2.5.1 & 2.6) For those of you who want/need a more technical description - please read the info above and it should answer your questions. If it does not - see the list of E-Cache experts to call that is listed on the E-Cache Internal Home Page (where you just were). The greatest benefit to customers is the cache-scrubbing mechanism. This is part of the 2.5.1 and 2.6 kernel updates. 2.6, Solaris 7 and Solaris 8 have the additional benefit of better error handling and error messaging. For technical reasons, with 2.5.1 it was not possible to have the better error handling and error messaging. Keep in mind that we still highly recommend the kernel update for 2.5.1 because of the cache scrubbing benefits that become part of the kernel when the kernel update is installed. Q5: If my customer does not currently have the new 5762 module, what should I do? A5: 1) Install the appropriate Kernel Update Patch as soon as possible. 2) Follow Best Practices. 3) As per Masood Jabbar and Larry Hambley's e-mail: "Note: The efficiency of the software is greatest when combined with systems based on the newer 5762 components. In keeping with our goal of 100% customer satisfaction, we are willing to replace all older 5661 components with the newer, more reliable 5762 components upon customer request. This process will take from four to eight months to complete as manufacturing of the newer component ramps up. We will be working first with customers whose systems have been affected. Enterprise Services and GSO will be working to create an account specific replacement strategy and plan for affected systems. Start working now with your service and SE teams to begin creating your account migration plans. Please help affected customers prioritize the mission criticality of their affected systems for scheduling the replacement." Q6: Is this the same as the famous Intel Pentium bug? A6: No this is very different than the infamous Intel FP division bug that was found in the fall of 94. The Intel FP division bug was coming up with the WRONG answers. An e-cache parity error does not corrupt data by coming up with the incorrect answer. When the extremely rare e-cache parity error occurs, Solaris takes the correct action, which is issue a panic to the system. We have a kernel update that will be able to minimize some of these panics. Q7: I think my customer believes we should have ZERO processor errors - is this realistic? A7: This is a very important point for all of us to understand. It is not realistic to expect zero problems. We need to make sure we have the proper system for the availability needs of our customers. Q8: ES talks about a Best Practices Guide - what is this? A8: The Best Practices Guide is a document that was put together to address key issues proven to affect E3000 to E10000 system reliability, with particular emphasis on the UltraSPARC CPU module. Initially this was internal only and used primarily by Enterprise Services. Today, the internal document has been reworked into a customer-ready document which also serves as the Best Practices Guide for the field. The internal tools and URLs that were previously mentioned in the original "internal" document have been removed from the customer-ready document and put in an internal document titled "Sun Resource Appendix." Both documents can be found at: http://bestpractices.central/ It was developed to capture the best processes that we have seen worldwide. It sets a benchmark for excellence in lifetime management of products. Q9: What is this about a mirrored or shadow e-cache module that I keep hearing about? A9: First, it is very important to know that we believe that the kernel update for Solaris achieves or exceeds Sun's original reliability objectives. Please read the question on what is a cache scrubber if you have not done so. Mirrored e-cache is an example of our continuous incremental improvements that we have been doing since day one with our processor modules. Please remember that most of Sun's customers are not having any e-cache issues. The limited ones that are affected by e-cache issues will, in all probability, be better served by the kernel update than disrupting their systems. This summer, initial sampling of this HW component technology - mirrored SRAM - became available to Sun's engineering team. Laboratory tests indicated that the re-engineered SRAMs further improves availability. Availability of this new SRAM, limited initially, begins this October. The limited quantities will be directed towards those accounts that have implemented best practices, the kernel scrubber, and are still seeing unacceptable levels of system availability. Volume production (GA - General Availability) will be in the Spring of 2001. Mirrored e-cache is much like mirroring disk drives. We have put one 16MB SRAM (Static Random Access Memory) on one processor module. To the application it appears exactly the same as before - 8MB. To Solaris it allows us to split the 16MB SRAM into two 8MB mirrors. This allows us to write to both mirrors and read from one mirror. If an e-cache parity error occurs, we simply read from the other 8MB mirror and correct the bad copy. The odds of having an e-cache error occur on the same cache line at the same instant in time is close to zero. The installation of this CPU module requires that the kernel update be installed. It is important to note that the kernel update with 5762 modules already improves system reliability significantly, in many cases eliminating the problem. Q10: Can I mix and match mirrored SRAM modules with non-mirrored? A10: First of all before mirrored SRAM's are appropriate for consideration, the affected systems should have the 5762 modules and the kernel update installed and only if those have not proved effective should the mirrored SRAM module be considered. Technically mixing and matching mirrored SRAM modules with non-mirrored can be done, however, we are not recommending it. The reason we do not want to mix and match is because of total customer downtime. We believe that for those rare systems which are not substantially improved by Kernel Level Scrubber and 5762 modules, it is in our customers best interests if we simply replace all the CPU modules when the situation requires mirrored SRAM's. Q11: Do our FT systems have E-Cache problems? A11: Our FT systems are designed very differently than our standard commercial servers. Sun's FT systems are designed for specialized markets such as Telco switching equipment. An e-cache parity error in a Sun FT system would not cause a system panic. Q12: The Solaris 7 & 8 kernel updates are not available yet-should I install the user level scrubber on systems that may be affected? A12: User Level Scrubber v3 (for Solaris 7 & 8) is available only upon request: contact Joe LaVery or Roger Koos. This code is not yet available on SunSolve. Migrate to the Kernel Level Scrubber as soon as it becomes available. Q13: We are telling the field to migrate our customers to Solaris 8- but the kernel update is not available for 6 weeks - why? A13: We had a tough decision to make and the best solution was to provide the kernel update first for the most used version of Solaris that is currently being used by our customers. That version is Solaris 2.6 and 2.5.1 Expect kernel updates for Solaris 7 and Solaris 8 in six weeks. Q14: I hear some talk about alpha-particles - what is this and does it affect our systems? A14: Alpha particles are randomly generated positively charged energetic nuclear particles originating from either of two sources: - extraterrestrial cosmic rays which come from outer space and constantly bombard the earth - or from the decay of natural occurring radioisotopes like Radon, Thorium and Uranium. Alpha particles can cause what is known as a SEU or Soft Error Upset. This can cause a e-cache parity error to occur. Alpha particles do not cause permanent damage but they can cause a bit to be flipped. You need to be very careful with this information because you can quickly get into discussions about issues that may not be your area of expertise. What is important to remember is that we are doing a number of things to address this such as checking environmentals of the data center, cache scrubber, kernel update, CPU and board shrouding (method of protecting electronic components from static & EMI), new 5762 modules and mirrored SRAM's. Q15: If my customer paid for Platinum and SRS - would we have caught these problems ahead of time? A15: What we have learned is that pro-active monitoring of systems can be absolutely critical for the success of a data center. In customers for which we have come in and pro-actively started monitoring the systems and the data center we have made tremendous improvements in reducing not only the e-cache parity errors but other systems concerns as well. Q16: We talk about the 14 A&Q initiatives - I have not seen specific metrics on success or updates - what is going on with these initiatives? A16: For more detail on Sun's Availability and Quality Initiatives please see: http://aq.eng/cgi-bin/view?doc=183 Q17: My customer is now questioning all aspects of error checking on our systems - what should I do? A17: This reaction is to be expected. Please work with your Director Of Technology to get the right Engineer in front of the customer to explain the e-cache situation as well as the RAS characteristics of our systems. Q18: Can SyMon or SunMC detect E-Cache errors? A18: Today SunMC or SyMon can not detect e-cache parity errors. You would still be going through log files after a system panic and reboot. Q19: Do we refer to this as a design defect? A19: Absolutely not. A design defect can be isolated and the problem can be duplicated. We are dealing with transient vulnerability in a specific component. We have found that 9x% of all modules which fail with an ecache error do not fail a second time. This is why we do not recommend swapping on first failure. Q20: How has this changed our design methodology? A20: We have made significant investments in numerous areas. Please see: http://aq.eng/cgi-bin/view?doc=183 Q21: Why not just provide an upgrade program from our existing systems to our FT systems - wouldn't this solve the problem? A21: As was mentioned earlier, our FT systems are not designed for general purpose computing. FT systems do not have the number of processors, amount of storage, etc. that is typically needed for commercial environments. In addition customers are unwilling to pay for availability features that they will not need. Q22: When we do a POST - Power On Self Test - can we turn on any switches or options so the system could check for E$ errors? A22: POST runs extensive testing of all components and does check the processors. The challenge with e-cache parity errors is the problem is transient. We have added to our initial screening process in manufacturing additional e-cache testing. Q23: What about future UltraSPARC II processors? Are they designed differently to protect against E$ problems? A23: The next, and final speed improvement for the UltraSPARC II processor is in the final design stages at this time. There will be circuit changes made to minimize noise even further and to deal with signal integrity problems that might affect SRAM. The testing phase of this processor module will be extended to collect the data to prove that the resulting module has an extremely high reliability. Q24: What about the UltraSPARC-III -- does it have protection on the external e-cache? A24: Yes. UltraSPARC-III has ECC protection on its external cache.