# Papers

The Proceedings of the PASC Conference are published in the Association for Computing Machinery’s (ACM’s) Digital Library. In recognition of the high quality of the PASC Conference papers track, the ACM continues to provide the proceedings as an Open Table of Contents (OpenTOC). This means that the definitive versions of PASC Conference papers are available to everyone at no charge to the author and without any pay-wall constraints for readers.

The OpenTOC for the PASC Conference is hosted on the ACM’s SIGHPC website. PASC papers can be accessed for free at: www.sighpc.org/for-our-community/acm-open-tocs.

The following papers will be presented as talks at PASC23, and will be accessible on the OpenTOC library post-conference.

## AI Super-Resolution Subfilter Modeling for Multi-Physics Flows

Many complex simulations are extremely expensive and hardly if at all doable, even on current supercomputers. A typical reason for this are coupled length and time scales in the application which need to be resolved simultaneously. As a result, many simulation approaches rely on scale-splitting, where only the larger scales are simulated, while the small scales are modeled with subfilter models. This work presents a novel subfilter modeling approach based on AI super-resolution. A physics-informed enhanced super-resolution generative adversarial network (PIESRGAN) is used to accurately close subfilter terms in the solved transport equations. It is demonstrated how a simulation design with the PIESRGAN-approach can be used to accelerate complex simulations on current supercomputers, on the example of three fluid dynamics simulation setups with complex features on the supercomputer environment JURECA-DC/JUWELS (Booster). Further advantages and shortcoming of the PIESRGAN-approach are discussed.

Author(s): Mathis Bode (Forschungszentrum Jülich)

Domain: Engineering

## Approximation and Optimization of Global Environmental Simulations with Neural Networks

Solving a system of hundreds of chemical differential equations in environmental simulations has a major computational complexity, and thereby requires high performance computing resources, which is a challenge as the spatio-temporal resolution increases. Machine learning methods and specially deep learning can offer an approximation of simulations with some factor of speed-up while using less compute resources. In this work, we introduce a neural network based approach (ICONET) to forecast trace gas concentrations without executing the traditional compute-intensive atmospheric simulations. ICONET is equipped with a multifeature Long Short Term Memory (LSTM) model to forecast atmospheric chemicals iteratively in time. We generated the training and test dataset, our target dataset for ICONET, by execution of an atmospheric chemistry simulation in ICON-ART. Applying the ICONET trained model to forecast a test dataset results in a good fit of the forecast values to our target dataset. We discussed appropriate metrics to evaluate the quality of models and presented the quality of the ICONET forecasts with RMSE and KGE metrics. The variety in the nature of trace gases limits the model’s learning and forecast skills according to the respective trace gas. In addition to the quality of the ICONET forecasts, we described the computational efficiency of ICONET as its run time speed-up in comparison to the run time of the ICON-ART simulation. The ICONET forecast showed a speed-up factor of 3.1 over the run time of the atmospheric chemistry simulation of ICON-ART, which is a significant achievement, especially when considering the importance of ensemble simulation.

Author(s): Elnaz Azmi (Karlsruhe Institute of Technology), Jörg Meyer (Karlsruhe Institute of Technology), Marcus Strobl (Karlsruhe Institute of Technology), Michael Weimer (Massachusetts Institute of Technology), and Achim Streit (Karlsruhe Institute of Technology)

Domain: Climate, Weather and Earth Sciences

## Causal Discovery and Optimal Experimental Design for Genome-Scale Biological Network Recovery

Causal discovery of genome-scale networks is important for identifying pathways from genes to observable traits –e.g. differences in cell function, disease, drug resistance and others. Causal learners based on graphical models rely on interventional samples to orient edges in the network. However, these models have not been shown to scale up the size of the genome which on the order of 1e3-1e4 genes. We introduce a new learner, SP-GIES, that jointly learns from interventional and observational datasets and achieves almost 4x speedup against an existing learner for 1,000 node networks. SP-GIES achieves an AUC-PR score of 0.91 on 1,000 node networks, and scales up to 2,000 node networks — this is 4x larger than existing works. We also show how SP-GIES improves downstream optimal experimental design strategies for selecting interventional experiments to perform on the system. This is an important step forward in realizing causal discovery at scale via autonomous experimental design.

Author(s): Ashka Shah (University of Chicago, Argonne National Laboratory), Arvind Ramanathan (Argonne National Laboratory, University of Chicago), Valerie Hayot-Sasson (University of Chicago, Argonne National Laboratory), and Rick Stevens (University of Chicago, Argonne National Laboratory)

Domain: Life Sciences

## Cornerstone: Octree Construction Algorithms for Scalable Particle Simulations

This paper presents an octree construction method, called Cornerstone, that facilitates global domain decomposition and interactions between particles in mesh-free numerical simulations. Our method is based on algorithms developed for 3D computer graphics, which we extend to distributed high performance computing (HPC) systems. Cornerstone yields global and locally essential octrees and is able to operate on all levels of tree hierarchies in parallel. The resulting octrees are suitable for supporting the computation of various kinds of short and long range interactions in N-body methods, such as Barnes-Hut and the Fast Multipole Method (FMM). While we provide a CPU implementation, Cornerstone may run entirely on GPUs. This results in significantly faster tree construction compared to execution on CPUs and serves as a powerful building block for the design of simulation codes that move beyond an offloading approach, where only numerically intensive tasks are dispatched to GPUs. With data residing exclusively in GPU memory, Cornerstone eliminates data movements between CPUs and GPUs. As an example, we employ Cornerstone to generate locally essential octrees for a Barnes-Hut treecode running on almost the full LUMI-G system with up to 8 trillion particles.

Author(s): Sebastian Keller (ETH Zurich / CSCS), Aurélien Cavelan (University of Basel), Rubén Cabezon (University of Basel), Lucio Mayer (University of Zurich), and Florina Ciorba (University of Basel)

Domain: Physics

## Data-Driven Whole-Genome Clustering to Detect Geospatial, Temporal, and Functional Trends in SARS-CoV-2 Evolution

Current methods for defining SARS-CoV-2 lineages ignore the vast majority of the SARS-CoV-2 genome. We develop and apply an exhaustive vector comparison that directly compares all known SARS-CoV-2 genome sequences to produce novel lineage classifications. We utilize data-driven models that (i) accurately capture the complex interactions across the set of all known SARS-CoV-2 genomes, (ii) scale to leadership-class computing systems, and (iii) enable tracking how such strains evolve geospatially over time. Analyses of this kind may produce actionable insights and transform our ability to prepare for and respond to current and future biological threats.

Author(s): Jean Merlet (University of Tennessee), John Lagergren (Oak Ridge National Laboratory), Verónica Melesse Vergara (Oak Ridge National Laboratory), Mikaela Cashman (Lawrence Berkeley National Laboratory), Christopher Bradburne (Johns Hopkins Whiting School of Engineering), Raina Plowright (Cornell University), Emily Gurley (Johns Hopkins Bloomberg School of Public Health), Wayne Joubert (Oak Ridge National Laboratory), and Daniel Jacobson (Oak Ridge National Laboratory)

Domain: Life Sciences

## Exploiting Symmetries for Preconditioning Poisson’s Equation in CFD Simulations

Divergence constraints are present in the governing equations of many physical phenomena, and they usually lead to a Poisson equation whose solution is one of the most challenging parts of scientific simulation codes. Indeed, it is the main bottleneck of incompressible Computational Fluid Dynamics (CFD) simulations, and developing efficient and scalable Poisson solvers is a critical task. This work presents an enhanced variant of the Factored Sparse Approximate Inverse (FSAI) preconditioner. It arises from exploiting *s* spatial reflection symmetries, which are often present in academic and industrial configurations and allow transforming Poisson’s equation into a set of 2^*s* fully-decoupled subsystems. Then, we introduce another level of approximation by taking advantage of the subsystems’ close similarity and applying the same FSAI to all of them. This leads to substantial memory savings and notable increases in the arithmetic intensity resulting from employing the more compute-intensive sparse matrix-matrix product. Of course, recycling the same preconditioner on all the subsystems worsens its convergence. However, this effect was much smaller than expected and made us introduce relatively cheap but very effective low-rank corrections. A key feature of these corrections is that thanks to being applied to each subsystem independently, the more symmetries being exploited, the more effective they become, leading to up to 5.7x faster convergences than the standard FSAI. Numerical experiments on up to 1.07 billion grids confirm the quality of our low-rank corrected FSAI, which, despite being 2.6x lighter, outperforms the standard FSAI by a factor of up to 4.4x.

Author(s): Àdel Alsalti-Baldellou (Polytechnic University of Catalonia, Termo Fluids SL), Carlo Janna (University of Padova, M3E srl), Xavier Álvarez-Farré (SURF), and F. Xavier Trias (Polytechnic University of Catalonia)

Domain: Engineering

**ACM Papers – PASC23 Plenary Paper**

## FourCastNet: Accelerating Global High-Resolution Weather Forecasting Using Adaptive Fourier Neural Operators

Extreme weather amplified by climate change is causing increasingly devastating impacts across the globe. The current use of physics-based numerical weather prediction (NWP) limits accuracy due to high computational cost and strict time-to-solution limits. We report that a data-driven deep learning Earth system emulator, FourCastNet, can predict global weather and generate medium-range forecasts five orders-of-magnitude faster than NWP while approaching state-of-the-art accuracy. FourCastNet is optimized and scales efficiently on three supercomputing systems: Selene, Perlmutter, and JUWELS Booster up to 3,808 NVIDIA A100 GPUs, attaining 140.8 Petaflops in mixed precision (11.9% of peak at that scale). The time-to-solution for training FourCastNet measured on JUWELS Booster on 3,072GPUs is 67.4minutes, resulting in an 80,000 times faster time-to-solution relative to state-of-the-art NWP, in inference. FourCastNet produces accurate instantaneous weather predictions for a week in advance and enables enormous ensembles that could be used to improve predictions of rare weather extremes.

Author(s): Thorsten Kurth (NVIDIA Inc.), Shashank Subramanian (Lawrence Berkeley National Laboratory), Peter Harrington (Lawrence Berkeley National Laboratory), Jaideep Pathak (NVIDIA Inc.), Morteza Mardani (NVIDIA Inc.), David Hall (NVIDIA Inc.), Andrea Miele (NVIDIA Inc.), Karthik Kashinath (NVIDIA Inc.), and Anima Anandkumar (NVIDIA Inc.)

Domain: Climate, Weather and Earth Sciences

## FPGA Acceleration for HPC Supercapacitor Simulations

In the search of more energy efficient computing devices that could be assembled to build future exascale systems, this study proposes a chip to chip comparison between a CPU, a GPU and a FPGA, as well as a scalability study on multiple FPGAs from two of the available vendors. The application considered here has been extracted from a production code in material science. This allows for the benchmarking of different implementations to be performed on a production test case and not just theoretical ones. The core algorithm is a matrix free conjugate gradient that computes the total electrostatic energy thanks to an Ewald summation at each iteration. This paper depicts the original MPI implementation of the application, details a numerical accuracy study and explains the methodology followed as well as the resulting FPGA implementation based on MaxCompiler. The FPGA implementation using 40 bits floating point number representation outperforms the CPU implementation both in terms of computing power and energy usage resulting in an energy efficiency more than 25 times better. Compared to the GPU of the same generation, the FPGA reaches 60\% of the GPU performance while the ratio of the performance per watt is still better by a factor of 2. Thanks to its low average power usage, the FPGA bests both fully loaded CPU and GPU in terms of number of conjugate gradient iterations per second and per watt. Finally, an implementation using OneAPI is described as well, showcasing a new development environment for FPGA in HPC.

Author(s): Charles Prouveur (CNRS), Matthieu Haefele (CNRS), Tobias Kenter (Paderborn University), and Nils Voss (Imperial College London)

Domain: Chemistry and Materials

## Graph Contractions for Calculating Correlation Functions in Lattice QCD

Computing correlation functions for many-particle systems in Lattice QCD is vital to extract nuclear physics observables like the energy spectrum of hadrons such as protons. However, this type of calculation has long been considered to be very challenging because of the complex nature of a hadron composed of quarks with many degrees of freedom. In particular, a correlation function can be calculated through a sum of all possible pairs of quark contractions dictated by *Wick*‘s theorem. Because the number of terms of this sum can be very large for any hadronic system of interest, fast evaluation of the sum faces several challenges: an extremely large number of contractions, a huge memory footprint, and the speed of contractions. In this paper, we present a Lattice QCD analysis software suite, *Redstar*, which addresses these challenges by utilizing novel algorithmic and software engineering methods targeting modern computing platforms such as many-core CPUs and GPUs. In particular, *Redstar* represents every term in the sum of a correlation function by a graph, applies efficient graph algorithms to reduce the number of contractions, to lower the cost of the computations, and to minimize the total memory footprint. Moreover, *Redstar* carries out the contractions on either CPUs or GPUs utilizing an internal and highly efficient *Hadron* contraction library. Specifically, we illustrate some important algorithmic optimizations of *Redstar*, show key design features of *Hadron* library, and present the speedup values due to the optimizations along with performance figures for calculating six correlations functions on four computing platforms.

Author(s): Jie Chen (Jefferson Lab), Robert Edwards (Jefferson Lab), and Weizhen Mao (William & Mary)

Domain: Physics

## Hardware-Agnostic Interactive Exascale In Situ Visualization of Particle-In-Cell Simulations

The volume of data generated by exascale simulations requires scalable tools for analysis and visualization. Due to the relatively low I/O bandwidth of modern HPC systems, it is crucial to work as close as possible with simulated data via in situ approaches. In situ visualization provides insights into simulation data and, with the help of additional interactive analysis tools, can support the scientific discovery process at an early stage. Such in situ visualization tools need to be hardware-independent given the ever-increasing hardware diversity of modern supercomputers. We present a new in situ 3D vector field visualization algorithm for particle-in-cell (PIC) simulations and performance evaluation of the solution developed at large-scale. We create a solution in a hardware-agnostic approach to support high throughput and interactive in situ processing on leadership class computing systems. To that end, we demonstrate performance portability on Summit’s and the Frontier’s pre-exascale testbed at the Oak Ridge Leadership Computing Facility.

Author(s): Felix Meyer (Helmholtz-Zentrum Dresden-Rossendorf, TU Dresden), Benjamin Hernandez (Oak Ridge National Laboratory), Richard Pausch (Helmholtz-Zentrum Dresden-Rossendorf), Rene Widera (Helmholtz-Zentrum Dresden-Rossendorf), David Groß (TU Dresden), Sergei Bastrakov (Helmholtz-Zentrum Dresden-Rossendorf), Axel Huebl (Lawrence Berkeley National Laboratory), Guido Juckeland (Helmholtz-Zentrum Dresden-Rossendorf), Jeffrey Kelling (Helmholtz-Zentrum Dresden-Rossendorf), Matt Leinhauser (University of Delaware, CASUS), David Rogers (Oak Ridge National Laboratory), Ulrich Schramm (Helmholtz-Zentrum Dresden-Rossendorf, TU Dresden), Klaus Steiniger (Helmholtz-Zentrum Dresden-Rossendorf), Stefan Gumhold (TU Dresden), Jeff Young (Georgia Institute of Technology), Michael Bussmann (Helmholtz-Zentrum Dresden-Rossendorf, CASUS), Sunita Chandrasekaran (University of Delaware), and Alexander Debus (Helmholtz-Zentrum Dresden-Rossendorf)

Domain: Physics

## Lessons Learned from a Performance Analysis and Optimization of a Multiscale Cellular Simulation

This work presents a comprehensive performance analysis and multiscale agent-based cellular simulation optimization. The optimizations applied are guided by detailed performance analysis and include memory management, load balance, and a locality-aware parallelization. The outcome of this paper is not only the speedup of 2.4x achieved by the optimized version with respect to the original PhysiCell code. But also the lessons learned and best practices when developing parallel HPC codes to obtain efficient and highly performant applications, especially in the computational biology field.

Author(s): Marc Clascà (Barcelona Supercomputing Center), Marta Garcia-Gasulla (Barcelona Supercomputing Center), Arnau Montagud (Barcelona Supercomputing Center), Jose Carbonell Caballero (Barcelona Supercomputing Center), and Alfonso Valencia (Barcelona Supercomputing Center, ICREA)

Domain: Life Sciences

## Longitudinal Effects on Plant Species Involved in Agriculture and Pandemic Emergence Undergoing Changes in Abiotic Stress

In this work we identify changes in high-resolution zones across the globe linked by environmental similarity that have implications for agriculture, bioenergy, and zoonosis. We refine exhaustive vector comparison methods with improved similarity metrics as well as provide multiple methods of amalgamation across 744 months of climatic data. The results of the vector comparison are captured as networks which are analyzed using static and longitudinal comparison methods to reveal locations around the globe experiencing dramatic changes in abiotic stress. Specifically we (i) incorporate updated similarity scores and provide a comparison between similarity metrics, (ii) implement a new feature for resource optimization, (iii) compare an agglomerative view to a longitudinal view, (iv) compare across 2-way and 3-way vector comparisons, (v) implement a new form of analysis, and (vi) demonstrate biological applications and discuss implications across a diverse set of species distributions by detecting changes that affect their habitats. Species of interest are related to agriculture (e.g., coffee, wine, chocolate), bioenergy (e.g., poplar, switchgrass, pennycress), as well as those living in zones of concern for zoonotic spillover that may lead to pandemics (e.g., eucalyptus, flying foxes).

Author(s): Mikaela Cashman (Lawrence Berkeley National Laboratory), Verónica G. Melesse Vergara (Oak Ridge National Laboratory), John Lagergren (Oak Ridge National Laboratory), Matthew Lane (University of Tennessee), Jean Merlet (University of Tennessee), Mikaela Atkinson (University of Tennessee), Jared Streich (Oak Ridge National Laboratory), Christopher Bradburne (Johns Hopkins Whiting School of Engineering), Raina Plowright (Cornell University), Wayne Joubert (Oak Ridge National Laboratory), and Daniel Jacobson (Oak Ridge National Laboratory)

Domain: Life Sciences

## A Massively Parallel Multi-Scale FE2 Framework for Multi-Trillion Degrees of Freedom Simulations

The advent of hybrid CPU and accelerator supercomputers opens the door to extremely large multi-scale simulations. An example of such a multi-scale technique, the FE2 approach, has been designed to simulate material deformations, by getting a better estimation of the material properties, which, in effect, reduces the need to introduce physical modelling at macro-scale level, such as constitutive laws, for instance. Both macro- and micro-scales are solved using the Finite Element method, the micro-scale being resolved at the Gauss points of the macro-scale mesh. As the micro-scale simulations do not require any information from each other, and are thus run concurrently, the stated problem is embarrassingly parallel. The FE2 method therefore directly benefits from hybrid machines, the macro- scale being solved on CPU whereas the micro-scale is offloaded to accelerators. The case of a flat plate, made of different materials is used to illustrate the potential of the method. In order to ensure good load balance on distributed memory machines, weighting based on the type of materials the plate is made of is applied by means of a Space Filling Curve technique. Simulations have been carried out for over 5 trillions of degrees of freedom on up to 2,048 nodes (49,152 CPUs and 12,288 GPUs) of the US DOE Oak Ridge National Laboratory high-end machine, Summit, showing an excellent speed-up for the assembly part of the framework, where the micro-scale is computed on GPU using CUDA.

Author(s): Charles Moulinec (Science and Technology Facilities Council), Guillaume Houzeaux (Barcelona Supercomputing Center), Ricard Borrell (Barcelona Supercomputing Center), Adria Quintanas (Barcelona Supercomputing Center), Guillermo Oyarzun (Barcelona Supercomputing Center), Judicael Grasset (CNRS), Guido Giuntoli (Barcelona Supercomputing Center), and Mariano Vazquez (Barcelona Supercomputing Center)

Domain: Engineering

## Mixed-Precision Random Projection for RandNLA on Tensor Cores

Random projection can reduce the dimension of data while capturing its structure and is a fundamental tool for machine learning, signal processing, and information retrieval, which deal with a large amount of data today. RandNLA (Randomized Numerical Linear Algebra) leverages random projection to reduce the computational complexity of low-rank decomposition of tensors and solve least-square problems. While the computation of the random projection is a simple matrix multiplication, its asymptotic computational complexity is typically larger than other operations in a RandNLA algorithm. Therefore, various studies propose methods for reducing its computational complexity. We propose a fast mixed-precision random projection method on NVIDIA GPUs using Tensor Cores for single-precision tensors. We exploit the fact that the random matrix requires less precision, and develop a highly optimized matrix multiplication between FP32 and FP16 matrices — SHGEMM (Single and Half GEMM) — on Tensor Cores, where the random matrix is stored in FP16. Our method can compute Randomized SVD 1.28 times faster and Random projection high order SVD 1.75 times faster than baseline single-precision implementations while maintaining accuracy.

Author(s): Hiroyuki Ootomo (Tokyo Institute of Technology), and Rio Yokota (Tokyo Institute of Technology)

Domain: Computer Science, Machine Learning, and Applied Mathematics

## Model-Based Performance Analysis of the HyTeG Finite Element Framework

In this work, we present how code generation techniques significantly improve the performance of the computational kernels in the HyTeG software framework. This HPC framework combines the performance and memory advantages of matrix-free multigrid solvers with the flexibility of unstructured meshes. The PYSTENCILS code generation toolbox is used to replace the original abstract C++ kernels with highly optimized loop nests. The performance of one of those kernels (the matrix-vector multiplication) is thoroughly analyzed using the Execution-Cache-Memory (ECM) performance model. We validate these predictions by measurements on the SuperMUC-NG supercomputer. The experiments show that the performance mostly matches the predictions. In cases where the prediction does not match, we discuss the discrepancies. Additionally, we conduct a node-level scaling study which shows the expected behavior for a memory-bound compute kernel.

Author(s): Dominik Thönnes (Friedrich-Alexander-Universität Erlangen-Nürnberg), and Ulrich Rüde (Friedrich-Alexander-Universität Erlangen-Nürnberg, CERFACS)

Domain: Computer Science, Machine Learning, and Applied Mathematics

## Performance Study of Convolutional Neural Network Architectures for 3D Incompressible Flow Simulations

Recently, correctly handling spatial information from multiple scales has proven to be essential in Machine Learning (ML) applications on Computational Fluid Dynamics (CFD) problems. For these type of applications, Convolutional Neural Networks (CNN) that use Multiple Downsampled Branches (MDBs) to efficiently encode spatial information from different spatial scales have proven to be some of the most successful architectures. However, not many guidelines exist to build these architectures, particularly when applied to more challenging 3D configurations. Thus, this work focuses on studying the impact of the choice of the number of downsampled branches, accuracy and performance-wise in 3D incompressible fluid test cases, where a CNN is used to solve the Poisson equation. The influence of this parameter is assessed by performing multiple trainings of Unet architectures with varying MDBs on a cloud-computing environment. These trained networks are then tested on two 3D CFD problems: a plume and a Von Karman vortex street at various operating points, where the solution of the neural network is coupled to a nonlinear advection equation.

Author(s): Ekhi Ajuria Illarramendi (CERFACS, ISAE SUPAERO), Michael Bauerheim (ISAE SUPAERO), Neil Ashton (AWS), Coretin Lapeyre (CERFACS), and Bénédicte Cuenot (CERFACS)

Domain: Engineering

## Runtime Steering of Molecular Dynamics Simulations Through In Situ Analysis and Annotation of Collective Variables

This paper targets one of the most common simulations on petascale and, very likely, on exascale machines: molecular dynamics (MD) simulations studying the (classical) time evolution of a molecular system at atomic resolution. Specifically, this work addresses the data challenges of MD simulations at exascale through (1) the creation of a data analysis method based on a suite of advanced collective variables (CVs) selected for annotation of structural molecular properties and capturing rare conformational events at runtime, (2) the definition of an in situ framework to automatically identify the frames where the rare events occur during an MD simulation and (3) the integration of both method and framework into two MD workflows for the study of early termination or termination and restart of a benchmark molecular system for protein folding -the Fs peptide system (Ace-A_5(AAARA)_3A-NME)- using Summit. The approach achieves faster exploration of the conformational space compared to extensive ensemble simulations. Specifically, our in situ framework with early termination alone achieves 99.6% coverage of the reference conformational space for the Fs peptide with just ~60% of the MD steps otherwise used for a traditional execution of the MD simulation. Annotation-based restart allows us to cover 94.6% of the conformational space, just running 50% of the overall MD steps.

Author(s): Silvina Caino-Lores (University of Tennessee), Michel Cuendet (Swiss Institute of Bioinformatics, Cornell University), Jack Marquez (University of Tennessee), Ekaterina Kots (Cornell University), Trilce Estrada (University of New Mexico), Ewa Deelman (University of Southern California), Harel Weinstein (Cornell University), and Michela Taufer (University of Tennessee)

Domain: Life Sciences

## Scalable Multi-FPGA Design of a Discontinuous Galerkin Shallow-Water Model on Unstructured Meshes

FPGAs are fostering interest as energy-efficient accelerators for scientific simulations, including for methods operating on unstructured meshes. Considering the potential impact on high-performance computing, specific attention needs to be given to the scalability of such approaches. In this context, the networking capabilities of FPGA hardware and software stacks can play a crucial role to enable solutions that go beyond a traditional host-MPI and accelerator-offload model. In this work, we present the multi-FPGA scaling of a discontinuous Galerkin shallow water model using direct low-latency streaming communication between the FPGAs. To this end, the unstructured mesh defining the spatial domain of the simulation is partitioned, the inter-FPGA network is configured to match the topology of neighboring partitions, and halo communication is overlapped with the dataflow computation pipeline. With this approach, we demonstrate strong scaling on up to eight FPGAs with a parallel efficiency of >80% and execution times per time step of as low as 7.6 µs. At the same time, with weak scaling, the approach allows to simulate larger meshes that would exceed the local memory limits of a single FPGA, now supporting meshes up to more than 100,000 elements and reaching an aggregated performance of up to 6.5 TFLOPs. Finally, a hierarchical partitioning approach allows for better utilization of the FPGA compute resources in some designs and, by mitigating limitations posed by the communication topology, enables simulations with up to 32 partitions on 8 FPGAs.

Author(s): Jennifer Faj (Paderborn University), Tobias Kenter (Paderborn University), Sara Faghih-Naini (University of Bayreuth), Christian Plessl (Paderborn University), and Vadym Aizinger (University of Bayreuth)

Domain: Climate, Weather and Earth Sciences

## Scalable Riemann Solvers with the Discontinuous Galerkin Method for Hyperbolic Network Simulation

We develop a set of highly efficient and effective computational algorithms and simulation tools for fluid simulations on a network. The mathematical models are a set of hyperbolic conservation laws on edges of a network, as well as coupling conditions on junctions of a network. For example, the shallow water system, together with flux balance and continuity conditions at river intersections, model water flows on a river network. The computationally accurate and robust discontinuous Galerkin methods, coupled with explicit strong stability preserving Runge-Kutta methods, are implemented for simulations on network edges. Meanwhile, linear and nonlinear scalable Riemann solvers are being developed and implemented at network vertices. These network simulations result in tools that are added to the existing PETSc and DMNetwork software libraries for the scientific community in general. Simulation results of a shallow water system on a Mississippi river network with over one billion network variables are performed on an extreme-scale computer using up to 8,192 processor with an optimal parallel efficiency. Further potential applications include traffic flow simulations on a highway network and blood flow simulations on a arterial network, among many others.

Author(s): Aidan Hamilton (University of Delaware, Argonne National Laboratory), Jingmei Qiu (University of Delaware), and Hong Zhang (Illinois Institute of Technology)

Domain: Computer Science, Machine Learning, and Applied Mathematics

## Scaling Resolution of Gigapixel Whole Slide Images Using Spatial Decomposition on Convolutional Neural Networks

Gigapixel images are prevalent in scientific domains ranging from remote sensing, and satellite imagery to microscopy, etc. However, training a deep learning model at the natural resolution of those images has been a challenge in terms of both, overcoming the resource limit (e.g. HBM memory constraints), as well as scaling up to a large number of GPUs. In this paper, we trained Residual neural Networks (ResNet) on 22,528 x 22,528-pixel size images using a distributed spatial decomposition method on 2,304 GPUs on the Summit Supercomputer. We applied our method on a Whole Slide Imaging (WSI) dataset from The Cancer Genome Atlas (TCGA) database. WSI images can be in the size of 100,000 x 100,000 pixels or even larger, and in this work we studied the effect of image resolution on a classification task, while achieving state-of-the-art AUC scores. Moreover, our approach doesn’t need pixel-level labels, since we’re avoiding patching from the WSI images completely, while adding the capability of training arbitrary large-size images. This is achieved through a distributed spatial decomposition method, by leveraging the non-block fat-tree interconnect network of the Summit architecture, which enabled GPU-to-GPU direct communication. Finally, detailed performance analysis results are shown, as well as a comparison with a data-parallel approach when possible.

Author(s): Aristeidis Tsaris (Oak Ridge National Laboratory), Josh Romero (NVIDIA Inc.), Thorsten Kurth (NVIDIA Inc.), Jacob Hinkle (Oak Ridge National Laboratory), Hong-Jun Yoon (Oak Ridge National Laboratory), Feiyi Wang (Oak Ridge National Laboratory), Sajal Dash (Oak Ridge National Laboratory), and Georgia Tourassi (Oak Ridge National Laboratory)

Domain: Computer Science, Machine Learning, and Applied Mathematics

## Streaming Generalized Canonical Polyadic Tensor Decompositions

In this paper, we develop a method which we call OnlineGCP for computing the Generalized Canonical Polyadic (GCP) tensor decomposition of streaming data. GCP differs from traditional canonical polyadic (CP) tensor decompositions as it allows for arbitrary objective functions which the CP model attempts to minimize. This approach can provide better fits and more interpretable models when the observed tensor data is strongly non-Gaussian. In the streaming case, tensor data is gradually observed over time and the algorithm must incrementally update a GCP factorization with limited access to prior data. In this work, we extend the GCP formalism to the streaming context by deriving a GCP optimization problem to be solved as new tensor data is observed, formulate a tunable history term to balance reconstruction of recently observed data with data observed in the past, develop a scalable solution strategy based on segregated solves using stochastic gradient descent methods, describe a software implementation that provides performance and portability to contemporary CPU and GPU architectures and demonstrate the utility and performance of the approach and software on several synthetic and real tensor data sets.

Author(s): Eric Phipps (Sandia National Laboratories), Nicholas Johnson (Cerebras Systems Inc), and Tamara Kolda (MathSci.ai)

Domain: Computer Science, Machine Learning, and Applied Mathematics

## StyleGAN as a Deconvolutional Operator for Large Eddy Simulations

We present a novel deconvolution operator for Large Eddy Simulation (LES) of turbulent flows based on the latest StyleGAN deep learning networks. We exploit the flexibility of this architecture in separating the different layers of the GAN generator, which can be seen as instantaneous fields of the LES. These can be moved in time via integrating the corresponding filtered Navier-Stokes (NS) equations. The subgrid-scale (SGS) stress tensor is obtained from the reconstructed field, rather than ad-hoc turbulence models. We trained a StyleGAN-based network (MSG-StyleGAN) with 5000 images of a decaying 2D-Homogeneous Isotropic Turbulence (2D-HIT) starting at *Re Pi* = 60 using a 256×256 grid mesh size. We then reconstructed a DNS simulation, point by point, using a 32×32 resolution via research into the latent space of the GAN until the difference between internal fields and LES fields are within a given tolerance. Results show convergence towards the ground truth DNS solution as the tolerance approaches zero.

Author(s): Jony Castagna (Science and Technology Facilities Council), and Francesca Schiavello (Science and Technology Facilities Council)

Domain: Engineering

## SweepNet: A Lightweight CNN Architecture for the Classification of Adaptive Genomic Regions

The accurate identification of positive selection in genomes represents a challenge in the field of population genomics. Several recent approaches have cast this problem as an image classification task and employed Convolutional Neural Networks (CNNs). However, limited efforts have been placed on discovering a practical CNN architecture that can classify images visualizing raw genomic data in the presence of population bottlenecks, migration, and recombination hotspots, factors that typically confound the identification and localization of adaptive genomic regions. In this work, we present SweepNet, a new CNN architecture that resulted from a thorough hyper-parameter-based architecture exploration process. SweepNet has a higher training efficiency than existing CNNs and requires considerably less epochs to achieve high validation accuracy. Furthermore, it performs consistently better in the presence of confounding factors, generating models with higher validation accuracy and lower top-1 error rate for distinguishing between neutrality and a selective sweep. Unlike existing network architectures, the number of trainable parameters of SweepNet remains constant irrespective of the sample size and number of Single Nucleotide Polymorphisms, which reduces the risk of overfitting and leads to more efficient training for large datasets. Our SweepNet implementation is available for download

at: https://github.com/Zhaohq96/SweepNet.

Author(s): Hanqing Zhao (University of Twente), Pavlos Pavlidis (Foundation for Research and Technology-Hellas), and Nikolaos Alachiotis (University of Twente)

Domain: Life Sciences

## Towards Lattice QCD+QED Simulations on GPUs

Improving the precision in particle physics predictions obtained from lattice simulations of quantum chromodynamics (QCD) requires extension of the interactions considered thus far, leading to additional computational demands. Most commonly used publicly available program packages for efficient simulations of Wilson discretization of the Dirac operator are highly scalable on CPU hardware. In order to be able to run efficiently on existing and upcoming hybrid architectures, one needs to rethink the current strategy for data types used at different stages of the simulation, most notably in frequent solves of the Dirac equation. We perform the first steps towards porting on GPUs of the three type of solvers used in the simulations of clover improved Wilson fermions: Conjugate Gradient, Schwarz preconditioned GCR solver, and a variant of the deflated solver. The analysis of the reduced precision data types’ impact on the convergence of each solver indicates several possibilities for overall performance improvement.

Author(s): Roman Gruber (ETH Zurich), Anton Kozhevnikov (ETH Zurich / CSCS), Marina Marinkovic (ETH Zurich), Thomas Schulthess (ETH Zurich / CSCS), and Raffaele Solca (ETH Zurich / CSCS)

Domain: Physics

## Understanding the Computing and Analysis Needs for Resiliency of Power Systems from Severe Weather Impacts

As the frequency and intensity of severe weather has increased, its effect on the electric grid has manifested in the form of significantly more and larger outages in the United States. This has become especially true for regions that were previously isolated from weather extremes. In this paper, we analyze the weather impacts on the electric power grid across a variety of weather conditions, draw correlations, and provide practical insights into the operational state of these systems. High resolution computational modeling of specific meteorological variables, computational approaches to solving power system models under these conditions, and the types of resiliency needs are highlighted as goal-oriented computing approaches are being built to address grid resiliency needs.

Author(s): Jibonananda Sanyal (National Renewable Energy Laboratory), Melissa Dumas (Oak Ridge National Laboratory), Sangkeun Lee (Oak Ridge National Laboratory), Supriya Chinthavali (Oak Ridge National Laboratory), Jennifer King (National Renewable Energy Laboratory), and Srijib Mukherjee (Oak Ridge National Laboratory)

Domain: Engineering

## Universal Data Junction: A Transport Layer for Data Driven Workflows

A novel transport library for the efficient coupling of applications through their data dependencies is presented. The design is driven by the intent to require minimal changes to existing scientific applications and to declare the data objects that are meaningful for other applications for read and write as well as to perform transparent transport including automatic redistribution of parallel data structures, thus permitting seamless coupling of applications in workflows. The actual transport can be selected at run time, and can exploit a variety of data exchange methods, including MPI, Dataspaces, Ceph Rados, CRAY Datawarp, and a POSIX file system. For the case of MPI transport, the library is used to implement the first stage of a co-working visualization pipeline for CP2K and results show a significant advantage compared to a filesystem based approach.

Author(s): Utz-Uwe Haus (HPE), Tim Dykes (HPE), Aniello Esposito (HPE), Clement Foyer (Université de Reims Champagne-Ardenne), and Adrian Tate (Numerical Algorithms Group)

Domain: Chemistry and Materials