Scalable GPU-Accelerated Incremental Checkpointing of Sparsely Updated Data
DescriptionCheckpointing large amounts of related data concurrently to stable storage is a common I/O pattern of many HPC applications in various scenarios including checkpoint-restart fault tolerance, coupled workflows that combine simulations with analytics, and adjoint computations. This pattern is challenging because it needs to happen frequently and typically leads to I/O bottlenecks that negatively impact the performance and scalability of the applications. A large class of applications including graph algorithms performs sparse updates between checkpoints. Incremental checkpointing approaches that save only the differences from one checkpoint to another can dramatically reduce I/O bottlenecks and storage utilization. However, such techniques are not without challenges: it is non-trivial to determine what data changed since a previous checkpoint transparently and to assemble the differences in a compact fashion that does not result in excessive metadata. This talk discusses the challenge of making efficient incremental checkpoints on GPU-accelerated platforms and introduces an innovative approach that builds a compact representation of the differences between checkpoints using Merkle-tree-inspired data structures for parallel data construction and manipulation. We assess the effectiveness of our approach with ORANGES, a graph alignment application with sparse update patterns.
TimeMonday, June 2617:30 - 18:00 CEST
Event Type
Computer Science, Machine Learning, and Applied Mathematics