SCHEDULE: NOV 16-22, 2013
When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.
A `Cool' Way of Improving the Reliability of HPC Machines
SESSION: Energy Management
EVENT TYPE: Papers
TIME: 1:30PM - 2:00PM
SESSION CHAIR: Taisuke Boku
AUTHOR(S):Osman Sarood, Esteban Meneses, Laxmikant Kale
Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for the next generation of supercomputers. Reliability at exascale level could degrade to the point where failures become the norm. HPC researchers are focusing on improving existing fault tolerance protocols. Research on improving hardware reliability has also been making progress independently. In this paper, we try to bridge this gap and combine both software and hardware aspects towards improving reliability of HPC machines. Fault rates are known to double for every 10 rise in core temperature. We leverage this notion to experimentally demonstrate the potential of restraining processor temperatures and load balancing to achieve two-fold benefits: improving reliability of parallel machines and reducing total execution time required by applications. For a 350K socket machine, regular checkpoint/restart fails to make progress, whereas our validated model predicts an efficiency of 20% by improving the machine reliability by 2.29X.
Taisuke Boku (Chair) - University of Tsukuba
Osman Sarood - University of Illinois at Urbana-Champaign
Esteban Meneses - University of Illinois at Urbana-Champaign
Laxmikant Kale - University of Illinois at Urbana-Champaign
The full paper can be found in the ACM Digital Library