BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Europe/Stockholm
X-LIC-LOCATION:Europe/Stockholm
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20230831T095746Z
LOCATION:Flüela
DTSTART;TZID=Europe/Stockholm:20230627T163000
DTEND;TZID=Europe/Stockholm:20230627T170000
UID:submissions.pasc-conference.org_PASC23_sess129_msa152@linklings.com
SUMMARY:Scaling Out Machine Learning Convergence Training on LUMI
DESCRIPTION:Minisymposium\n\nDiana Moise (HPE) and Samuel Antao (AMD)\n\nM
 L workloads are becoming a main FLOP consumer in many HPC centers in Europ
 e and around the world. These workloads leverage well established framewor
 ks like Pytorch or Tensorflow that also work as a portability layer enabli
 ng ML to be one of the first production-ready workloads for new systems us
 ing AMD Instinct GPUs. The process is not depleted of challenges though, a
 s there are many factors to be considered at scale, including node and net
 work topology, kernel launching latency, I/O intensity and profiling spann
 ing Python and C/C++ runtime libraries. The challenge that is unique to tr
 aining ML workloads is that these performance optimizations have a smaller
  or higher impact on the quality (accuracy metric) of the trained model. T
 his presentation will go over these challenges and the approaches found to
  address them. The talk will cover experiences and best practices collecte
 d from scaling out ML training on GPU systems at LUMI and other European H
 PC Centers.<em> </em>\n\nDomain: Computer Science, Machine Learning, and A
 pplied Mathematics &#8232;\n\nSession Chair: Aniello Esposito (HPE)
END:VEVENT
END:VCALENDAR
