Feng Shui of Supercomputer Memory: Positional Effects in DRAM and SRAM Faults

SESSION: Memory Resilience


TIME: 10:30AM - 11:00AM


AUTHOR(S):Vilas Sridharan, Jon Stearley, Nathan DeBardeleben, Sean Blanchard, Sudhanva Gurumurthi


Several recent publications confirm that faults are common in high-performance computing systems. Therefore, further attention to the faults experienced by such computing systems is warranted. In this paper, we present a study of DRAM and SRAM faults in large high-performance computing systems. Our goal is to understand the factors that influence faults in production settings. We examine the impact of aging on DRAM, finding a marked shift from permanent to transient faults in the first two years of DRAM lifetime. We examine the impact of DRAM vendor, finding that fault rates vary by more than 4x among vendors. We examine the physical location of faults in a DRAM device and in a data center; contrary to prior studies, we find no correlations with either. Finally, we study SRAM, finding that altitude has a substantial impact on SRAM faults, and that top of rack placement correlates with 20% higher fault rate.

Chair/Author Details:

Scott Pakin (Chair) - Los Alamos National Laboratory

Vilas Sridharan - Advanced Micro Devices, Inc.

Jon Stearley - Sandia National Laboratories

Nathan DeBardeleben - Los Alamos National Laboratory

Sean Blanchard - Los Alamos National Laboratory

Sudhanva Gurumurthi - Advanced Micro Devices, Inc.

