The International Conference for High Performance Computing, Networking, Storage and Analysis
Making GMRES Resilient to Silent Data Corruption.
Authors: James J. Elliott (North Carolina State University), Mark Hoemmen (Sandia National Laboratories), Frank Mueller (North Carolina State University)
Abstract: Increasing parallelism and transistor density, along with increasingly tighter energy and peak power constraints, may force exposure of occasionally incorrect computation or storage to application codes. Silent data corruption (SDC) will likely be infrequent, yet one SDC suffices to make numerical algorithms like iterative linear solvers cease progress towards the correct answer. Thus, we focus on resilience of the iterative linear solver GMRES to a single transient SDC. We derive inexpensive checks to detect the effects of an SDC in GMRES that work for a more general SDC model than presuming a bit-flip. Our experiments show that when GMRES is used as the inner solver of an inner-outer iteration, it can run-through SDC of almost any magnitude in the computationally intensive orthogonalization phase. That is, it gets the right answer using faulty data without any required roll back. Those SDCs, which it cannot run-through, get caught by our detection scheme.