SC13 Home > SC13 Schedule > SC13 Presentation - SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing

SCHEDULE: NOV 16-22, 2013

When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.

SPBC: Leveraging the Characteristics of MPI HPC Applications for Scalable Checkpointing

SESSION: Fault-Tolerant Computing

EVENT TYPE: Papers

TIME: 2:00PM - 2:30PM

SESSION CHAIR: Pavan Balaji

AUTHOR(S):Thomas Ropars, Tatiana Martsinkevich, Amina Guermouche, André Schiper, Franck Cappello

ROOM:201/203

ABSTRACT:
The high failure rate expected for future supercomputers requires the design of new fault tolerant solutions. Most checkpointing protocols are designed to work with any message-passing application but suffer from scalability issues at extreme scale. We take a different approach: We identify a property common to many HPC applications, namely channel-determinism, and introduce a new partial order relation, called always-happens-before relation, between events of such applications. Leveraging these two concepts, we design a protocol that combines an unprecedented set of features. Our protocol called SPBC combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process checkpoints, and this, without penalizing recovery performance. Experiments run with a representative set of HPC workloads demonstrate a good performance of our protocol during both, failure-free execution and recovery.

Chair/Author Details:

Pavan Balaji (Chair) - Argonne National Laboratory

Thomas Ropars - Swiss Federal Institute of Technology Lausanne

Tatiana Martsinkevich - INRIA and University of Paris Sud

Amina Guermouche - Université de Versailles Saint-Quentin en Yveline

André Schiper - Swiss Federal Institute of Technology Lausanne

Franck Cappello - Argonne National Laboratory

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar

The full paper can be found in the ACM Digital Library