SC13 Home > SC13 Schedule > SC13 Presentation - Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism

SCHEDULE: NOV 16-22, 2013

When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.

Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism

SESSION: Fault Tolerance and Migration in the Cloud


TIME: 3:30PM - 4:00PM


AUTHOR(S):Sheng Di, Yves Robert, Frederic Vivien, Derrick Kondo, Cho-Li Wang, Franck Cappello


In this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal number of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but attractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster environment with hundreds of virtual machines and Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young's formula by 3-10 percent, reducing wall-clock-lengths by 50-100 seconds per job on average.

Chair/Author Details:

Henry Tufo (Chair) - University of Colorado

Sheng Di - INRIA and Argonne National Laboratory (USA)

Yves Robert - ENS Lyon and University of Tennessee, Knoxville

Frederic Vivien - INRIA

Derrick Kondo - INRIA

Cho-Li Wang - University of Hong Kong

Franck Cappello - INRIA, Argonne National Laboratory and University of Illinois at Urbana-Champaign

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar

The full paper can be found in the ACM Digital Library