P41 - MPI for Multi-Core, Multi Socket, and GPU Architectures: Optimised Shared Memory Allreduce
DescriptionIn the literature the benefits of shared memory collectives especially allreduce have been shown. This intra-node communication is not only necessary for single node communications but it is also a key component of more complex inter-node communication algorithms [1]. In contrast to [2], our implementation of shared memory usage is invisible to the user of the library, the data of the send and receive buffers is not required to reside in shared memory already but the data from the send buffer is copied into the shared memory segment in parallel chunks where commutative reduction operations are necessary. Subsequently, the data is further reduced within the shared memory segment using a tree-based algorithm. The final result is then copied to the receive buffer. The reduction operations and synchronization barriers are combined during this process, and the algorithm is adapted, depending on performance measurements.
[1] Jocksch, A., Ohana, N., Lanti, E., Koutsaniti, E., Karakasis, V., Villard, L.: An optimisation of allreduce communication in message-passing systems. Parallel Computing 107, 102812 (2021)
[2] Li, S., Hoefler, T., Hu, C., Snir, M.: Improved MPI collectives for MPI processes in shared address spaces. Cluster computing 17(4), 1139–1155 (2014)
TimeTuesday, June 2719:30 - 21:30 CEST
Event Type