Presentation – PASC 2023

· Contributors · Organizations · Search Program · Happening Now

P41 - MPI for Multi-Core, Multi Socket, and GPU Architectures: Optimised Shared Memory Allreduce

PosterPDF

Presenter

DescriptionIn the literature the benefits of shared memory collectives especially allreduce have been shown. This intra-node communication is not only necessary for single node communications but it is also a key component of more complex inter-node communication algorithms [1]. In contrast to [2], our implementation of shared memory usage is invisible to the user of the library, the data of the send and receive buffers is not required to reside in shared memory already but the data from the send buffer is copied into the shared memory segment in parallel chunks where commutative reduction operations are necessary. Subsequently, the data is further reduced within the shared memory segment using a tree-based algorithm. The final result is then copied to the receive buffer. The reduction operations and synchronization barriers are combined during this process, and the algorithm is adapted, depending on performance measurements.
[1] Jocksch, A., Ohana, N., Lanti, E., Koutsaniti, E., Karakasis, V., Villard, L.: An optimisation of allreduce communication in message-passing systems. Parallel Computing 107, 102812 (2021)
[2] Li, S., Hoefler, T., Hu, C., Snir, M.: Improved MPI collectives for MPI processes in shared address spaces. Cluster computing 17(4), 1139–1155 (2014)

TimeTuesday, June 2710:12 - 10:13 CEST

LocationDavos

SessionFlash Poster Session - Part II

Session Chair

Jibonananda Sanyal

National Renewable Energy Laboratory

Event Type

Poster

Authors

Andreas Jocksch

ETH Zurich / CSCS

Jean-Guillaume Piccinali

ETH Zurich / CSCS