SC13 Denver, CO

The International Conference for High Performance Computing, Networking, Storage and Analysis

A Transactional Model for Fault-Tolerant MPI for Petascale and Exascale Systems.


Authors: Amin Hassani (University of Alabama at Birmingham), Anthony Skjellum (University of Alabama at Birmingham), Ron Brightwell (Sandia National Laboratories)

Abstract: Fault-Aware MPI (FA-MPI) is a novel approach to provide fault-tolerance through a set of extensions to the MPI Standard. It employs a transactional model to address failure detection, isolation, mitigation, and recovery via application-driven policies. This approach allows applications to employ different fault-tolerance techniques, such as algorithm-based fault tolerance (ABFT) and multi-level checkpoint/restart methods. The goal of FA-MPI is to support fault-awareness in MPI objects and enable applications to run to completion with higher probability than running on a non-fault-aware MPI. FA-MPI leverages non-blocking communication operations combined with a set of TryBlock API extensions that can be nested to support multi-level failure detection and recovery. Managing fault-free overhead is a key concern as well.

Poster: pdf
Two-page extended abstract: pdf


Poster Index