Task-Level Resilience for Dynamically Generated Tasks under Work Stealing in Clusters
DescriptionPermanent hardware failures of cluster nodes cause processes to abort and, if no precautions are taken, all previous compute results will be lost. Resilience can be achieved through checkpointing, which allows restarting applications from a saved state. However, writing checkpoints to a file system is costly and there is a delay before restarting. Therefore, alternative techniques store checkpoints in the main memory of other cluster nodes (in-memory checkpointing), reduce the checkpoint size by selecting the checkpoint data (application-level checkpointing), or recover and continue running on the intact nodes (shrinking localized recovery). These approaches can be nicely combined for Asynchronous Many-Task (AMT) programs, where a runtime system automatically assigns execution units called tasks to processes and threads. Since tasks have clean interfaces, the runtime can automatically select checkpoint data. Moreover, it can reassign tasks that were affected by a failure. Keeping track of tasks is not trivial, though, when tasks may dynamically generate new tasks and work stealing is used to balance the load by moving tasks from busy to idle processes and threads. The talk outlines a task-level checkpointing scheme for this environment. The scheme can handle independent, side-effect-free tasks under multiple failures with a runtime overhead below 1%.
TimeMonday, June 2617:00 - 17:30 CEST
Event Type
Computer Science, Machine Learning, and Applied Mathematics