Scaling Out Machine Learning Convergence Training on LUMI
DescriptionML workloads are becoming a main FLOP consumer in many HPC centers in Europe and around the world. These workloads leverage well established frameworks like Pytorch or Tensorflow that also work as a portability layer enabling ML to be one of the first production-ready workloads for new systems using AMD Instinct GPUs. The process is not depleted of challenges though, as there are many factors to be considered at scale, including node and network topology, kernel launching latency, I/O intensity and profiling spanning Python and C/C++ runtime libraries. The challenge that is unique to training ML workloads is that these performance optimizations have a smaller or higher impact on the quality (accuracy metric) of the trained model. This presentation will go over these challenges and the approaches found to address them. The talk will cover experiences and best practices collected from scaling out ML training on GPU systems at LUMI and other European HPC Centers.
TimeTuesday, June 2716:30 - 17:00 CEST
Session Chair
Event Type
Computer Science, Machine Learning, and Applied Mathematics