SC13 Home > SC13 Schedule > SC13 Presentation - A `Cool' Way of Improving the Reliability of HPC Machines

SCHEDULE: NOV 16-22, 2013

When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.

A `Cool' Way of Improving the Reliability of HPC Machines

SESSION: Energy Management


TIME: 1:30PM - 2:00PM


AUTHOR(S):Osman Sarood, Esteban Meneses, Laxmikant Kale


Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for the next generation of supercomputers. Reliability at exascale level could degrade to the point where failures become the norm. HPC researchers are focusing on improving existing fault tolerance protocols. Research on improving hardware reliability has also been making progress independently. In this paper, we try to bridge this gap and combine both software and hardware aspects towards improving reliability of HPC machines. Fault rates are known to double for every 10 rise in core temperature. We leverage this notion to experimentally demonstrate the potential of restraining processor temperatures and load balancing to achieve two-fold benefits: improving reliability of parallel machines and reducing total execution time required by applications. For a 350K socket machine, regular checkpoint/restart fails to make progress, whereas our validated model predicts an efficiency of 20% by improving the machine reliability by 2.29X.

Chair/Author Details:

Taisuke Boku (Chair) - University of Tsukuba

Osman Sarood - University of Illinois at Urbana-Champaign

Esteban Meneses - University of Illinois at Urbana-Champaign

Laxmikant Kale - University of Illinois at Urbana-Champaign

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar

The full paper can be found in the ACM Digital Library