BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Europe/Stockholm
X-LIC-LOCATION:Europe/Stockholm
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20230831T095746Z
LOCATION:Sertig
DTSTART;TZID=Europe/Stockholm:20230627T143000
DTEND;TZID=Europe/Stockholm:20230627T150000
UID:submissions.pasc-conference.org_PASC23_sess182_pap116@linklings.com
SUMMARY:Scalable Multi-FPGA Design of a Discontinuous Galerkin Shallow-Wat
 er Model on Unstructured Meshes
DESCRIPTION:Paper\n\nJennifer Faj and Tobias Kenter (Paderborn University)
 , Sara Faghih-Naini (University of Bayreuth), Christian Plessl (Paderborn 
 University), and Vadym Aizinger (University of Bayreuth)\n\nFPGAs are fost
 ering interest as energy-efficient accelerators for scientific simulations
 , including for methods operating on unstructured meshes. Considering the 
 potential impact on high-performance computing, specific attention needs t
 o be given to the scalability of such approaches. In this context, the net
 working capabilities of FPGA hardware and software stacks can play a cruci
 al role to enable solutions that go beyond a traditional host-MPI and acce
 lerator-offload model. In this work, we present the multi-FPGA scaling of 
 a discontinuous Galerkin shallow water model using direct low-latency stre
 aming communication between the FPGAs. To this end, the unstructured mesh 
 defining the spatial domain of the simulation is partitioned, the inter-FP
 GA network is configured to match the topology of neighboring partitions, 
 and halo communication is overlapped with the dataflow computation pipelin
 e. With this approach, we demonstrate strong scaling on up to eight FPGAs 
 with a parallel efficiency of >80% and execution times per time step of as
  low as 7.6 µs. At the same time, with weak scaling, the approach allows t
 o simulate larger meshes that would exceed the local memory limits of a si
 ngle FPGA, now supporting meshes up to more than 100,000 elements and reac
 hing an aggregated performance of up to 6.5 TFLOPs. Finally, a hierarchica
 l partitioning approach allows for better utilization of the FPGA compute 
 resources in some designs and, by mitigating limitations posed by the comm
 unication topology, enables simulations with up to 32 partitions on 8 FPGA
 s.\n\nDomain: Chemistry and Materials, Climate, Weather and Earth Sciences
 , Computer Science, Machine Learning, and Applied Mathematics &#8232;\n\nSession
  Chair: Mauro Bianco (ETH Zurich / CSCS)
END:VEVENT
END:VCALENDAR
