Exploiting the Overlapping Challenges of Asynchronous Many Task Runtimes and Resilience for Integrated Control-Flow and Data Resiliency
DescriptionAs computing clusters grow in scale, so do the competing demands of resilience and performance. Contemporary high-performance runtimes accommodate for heterogeneous hardware, performance variability, and dynamic workload distributions; these challenges require gathering application control-flow and data-flow characteristics to dynamically manage and relocate tasks and data. Contemporary resilience runtimes accommodate for lost data (soft failures) and lost execution contexts (hard failures); these challenges require similar knowledge of application control-flow and data-flow characteristics to dynamically rebuild and relocate tasks and data. These competing interests have largely converged on the application information and constraints they require. We exploit this to extend Kokkos Resilience – an automated checkpoint/recovery library – and Darma VT – an Asynchronous Many-Task (AMT) runtime – to share information and enable high-performance, highly-resilient applications with little change to VT applications. With our updates, Kokkos Resilience is able to track dynamic task usage, traverse implicit application data dependencies, and uniquely identify and serialize arbitrary user-defined data elements. Further, VT provides the capability to initiate remote-data checkpoint/recovery, define partial control-flow boundaries for checkpointing individual data collections, and rebalance workloads after node failure and relaunch. SNL is managed and operated by NTESS under DOE NNSA contract DE-NA0003525
TimeMonday, June 2616:30 - 17:00 CEST
Event Type
Computer Science, Machine Learning, and Applied Mathematics