BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Europe/Stockholm
X-LIC-LOCATION:Europe/Stockholm
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20230831T095745Z
LOCATION:Seehorn
DTSTART;TZID=Europe/Stockholm:20230626T163000
DTEND;TZID=Europe/Stockholm:20230626T170000
UID:submissions.pasc-conference.org_PASC23_sess161_msa262@linklings.com
SUMMARY:Exploiting the Overlapping Challenges of Asynchronous Many Task Ru
 ntimes and Resilience for Integrated Control-Flow and Data Resiliency
DESCRIPTION:Minisymposium\n\nMatthew Whitlock and Nic Morales (Sandia Nati
 onal Laboratories) and Keita Teranishi (Oak Ridge National Laboratory)\n\n
 As computing clusters grow in scale, so do the competing demands of resili
 ence and performance. Contemporary high-performance runtimes accommodate f
 or heterogeneous hardware, performance variability, and dynamic workload d
 istributions; these challenges require gathering application control-flow 
 and data-flow characteristics to dynamically manage and relocate tasks and
  data. Contemporary resilience runtimes accommodate for lost data (soft fa
 ilures) and lost execution contexts (hard failures); these challenges requ
 ire similar knowledge of application control-flow and data-flow characteri
 stics to dynamically rebuild and relocate tasks and data. These competing 
 interests have largely converged on the application information and constr
 aints they require. We exploit this to extend Kokkos Resilience – an autom
 ated checkpoint/recovery library – and Darma VT – an Asynchronous Many-Tas
 k (AMT) runtime – to share information and enable high-performance, highly
 -resilient applications with little change to VT applications. With our up
 dates, Kokkos Resilience is able to track dynamic task usage, traverse imp
 licit application data dependencies, and uniquely identify and serialize a
 rbitrary user-defined data elements. Further, VT provides the capability t
 o initiate remote-data checkpoint/recovery, define partial control-flow bo
 undaries for checkpointing individual data collections, and rebalance work
 loads after node failure and relaunch. SNL is managed and operated by NTES
 S under DOE NNSA contract DE-NA0003525\n\nDomain: Computer Science, Machin
 e Learning, and Applied Mathematics &#8232;\n\nSession Chair: Nicolas Morales (S
 andia National Laboratories)
END:VEVENT
END:VCALENDAR
