Presentation – PASC 2023

· Contributors · Organizations · Search Program · Happening Now

Scaling Out Machine Learning Convergence Training on LUMI

Presenters

DescriptionML workloads are becoming a main FLOP consumer in many HPC centers in Europe and around the world. These workloads leverage well established frameworks like Pytorch or Tensorflow that also work as a portability layer enabling ML to be one of the first production-ready workloads for new systems using AMD Instinct GPUs. The process is not depleted of challenges though, as there are many factors to be considered at scale, including node and network topology, kernel launching latency, I/O intensity and profiling spanning Python and C/C++ runtime libraries. The challenge that is unique to training ML workloads is that these performance optimizations have a smaller or higher impact on the quality (accuracy metric) of the trained model. This presentation will go over these challenges and the approaches found to address them. The talk will cover experiences and best practices collected from scaling out ML training on GPU systems at LUMI and other European HPC Centers.

SlidesPDF

TimeTuesday, June 2716:30 - 17:00 CEST

LocationFlüela

SessionMS4B - Porting of Applications to AMD GPUs: Lessons Learned

Session Chair

Aniello Esposito

HPE

Event Type

Minisymposium

Domains

Authors