SC13 Denver, CO

The International Conference for High Performance Computing, Networking, Storage and Analysis

Theory, Meet Practice: Challenges in Applying Failure Prediction on Large Systems.

Student: Ana Gainaru (University of Illinois at Urbana-Champaign)
Supervisor: Marc Snir (Argonne National Laboratory)

Abstract: As the size of supercomputers increases, so does the probability of a single component failure within a time frame. Checkpoint-Restart, the classical method to survive application failures, faces many challenges in the Exascale era due to frequent and large rollbacks. A complement to this approach is failure avoidance, by which the occurrence of a fault is predicted and proactive measures are taken. With the growing complexity of extreme scale supercomputers, the act of predicting failures in real time becomes cumbersome and presents a couple of challenges not encountered before. This work is nearly complete and presents key issues I have encountered when applying online failure prediction on the Blue Waters system. The overhead of combining fault prediction and checkpointing on smaller and large scale systems will also be reported. The results give insights on the challenges in achieving an effective fault prevention mechanism for current and future HPC systems.

Poster: pdf
Two-page extended abstract: pdf

Poster Index