SC13 Denver, CO

The International Conference for High Performance Computing, Networking, Storage and Analysis

Optimal Placement of Retry-Based Fault Recovery Annotations in HPC Applications.


Authors: Ignacio Laguna (Lawrence Livermore National Laboratory), Martin Schulz (Lawrence Livermore National Laboratory), Jeff Keasler (Lawrence Livermore National Laboratory), David Richards (Lawrence Livermore National Laboratory), Jim Belak (Lawrence Livermore National Laboratory)

Abstract: As larger HPC systems are built, fault recovery becomes a fundamental capability. Traditional fault recovery approaches, such as checkpointing, may not be sufficient for future exascale systems. Retry-based recovery techniques have been proposed as an alternative. These techniques simply re-execute a code region when a fault occurs and require code annotations. However, no previous work has investigated the optimal placement of these annotations in a program. Via fault injection, we evaluate how to place optimally retry annotations in a hydrodynamics mini application. We found that, contrary to our expectations, a simple scheme of protecting the main function works well for low fault rates: slowdown is up to 1.25 for a 3 faults/hour rate. We also found that the optimal recovery method is rolling a few iterations back in the application's main loop.

Poster: pdf
Two-page extended abstract: pdf


Poster Index