SC13 Home > SC13 Schedule > SC13 Presentation - Distributed Wait State Tracking for Runtime MPI Deadlock Detection

SCHEDULE: NOV 16-22, 2013

When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.

Distributed Wait State Tracking for Runtime MPI Deadlock Detection

SESSION: MPI Performance and Debugging

EVENT TYPE: Papers

TIME: 1:30PM - 2:00PM

SESSION CHAIR: Thomas Fahringer

AUTHOR(S):Tobias Hilbrich, Bronis R. de Supinski, Wolfgang E. Nagel, Joachim Protze, Christel Baier, Matthias S. Mueller

ROOM:201/203

ABSTRACT:
The widely used Message Passing Interface (MPI) with its multitude of communication functions is prone to usage errors. Runtime error detection tools aid in the removal of these errors. We develop MUST as one such tool that provides a wide variety of automatic correctness checks. Its correctness checks can be run in a distributed mode, except for its deadlock detection. This limitation applies to a wide range of tools that either use centralized detection algorithms or a timeout approach. In order to provide scalable and distributed deadlock detection with detailed insight into deadlock situations, we propose a model for MPI blocking conditions that we use to formulate a distributed algorithm. This algorithm implements scalable MPI deadlock detection in MUST. Stress tests at up to 4,096 processes demonstrate the scalability of our approach. Finally, overhead results for a complex benchmark suite demonstrate an average runtime increase of 34% at 2,048 processes.

Chair/Author Details:

Thomas Fahringer (Chair) - University of Innsbruck

Tobias Hilbrich - Technische Universität Dresden

Bronis R. de Supinski - Lawrence Livermore National Laboratory

Wolfgang E. Nagel - Technische Universität Dresden

Joachim Protze - Aachen University

Christel Baier - Technische Universität Dresden

Matthias S. Mueller - Aachen University

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar

The full paper can be found in the ACM Digital Library