SC13 Denver, CO

The International Conference for High Performance Computing, Networking, Storage and Analysis

Scalable Parallel Debugging via Loop-Aware Progress Dependence Analysis.

Authors: Subrata Mitra (Purdue University), Ignacio Laguna (Lawrence Livermore National Laboratory), Dong H. Ahn (Lawrence Livermore National Laboratory), Todd Gamblin (Lawrence Livermore National Laboratory), Martin Schulz (Lawrence Livermore National Laboratory), Saurabh Bagchi (Purdue University)

Abstract: Debugging large-scale parallel applications is challenging, as this often requires extensive manual efforts to isolate the origin of errors. For many bugs in a scientific application, where its tasks progress forward in a coordinated fashion, finding those tasks that progressed the least can significantly reduce the time to isolate the root- cause. We present a novel run-time technique, the loop-aware progress-dependence analysis, which can improve the accuracy of identifying the least-progressed (LP) task(s). We extend AutomaDeD to detect LP task(s) even when the error arises on code with complex loop structures. Our evaluation shows that it accurately finds LP task(s) on several hangs on which the baseline technique fails. During the poster session, we will begin with a case that illustrates some of the challenges in accurately analyzing progress dependencies of MPI tasks executing within a loop, and then present key techniques needed to address these challenges.

Poster: pdf
Two-page extended abstract: pdf

Poster Index