Presentation – PASC 2023

· Contributors · Organizations · Search Program · Happening Now

Task-Level Resilience for Dynamically Generated Tasks under Work Stealing in Clusters

Presenter

DescriptionPermanent hardware failures of cluster nodes cause processes to abort and, if no precautions are taken, all previous compute results will be lost. Resilience can be achieved through checkpointing, which allows restarting applications from a saved state. However, writing checkpoints to a file system is costly and there is a delay before restarting. Therefore, alternative techniques store checkpoints in the main memory of other cluster nodes (in-memory checkpointing), reduce the checkpoint size by selecting the checkpoint data (application-level checkpointing), or recover and continue running on the intact nodes (shrinking localized recovery). These approaches can be nicely combined for Asynchronous Many-Task (AMT) programs, where a runtime system automatically assigns execution units called tasks to processes and threads. Since tasks have clean interfaces, the runtime can automatically select checkpoint data. Moreover, it can reassign tasks that were affected by a failure. Keeping track of tasks is not trivial, though, when tasks may dynamically generate new tasks and work stealing is used to balance the load by moving tasks from busy to idle processes and threads. The talk outlines a task-level checkpointing scheme for this environment. The scheme can handle independent, side-effect-free tasks under multiple failures with a runtime overhead below 1%.

SlidesPDF

TimeMonday, June 2617:00 - 17:30 CEST

LocationSeehorn

SessionMS2E - Performance in I/O and Fault Tolerance for Scientific Applications

Session Chair

Nicolas Morales

Sandia National Laboratories

Event Type

Minisymposium

Domains

Author

Claudia Fohry

University of Kassel