[table id=1 /]

[table id=2 /]

[table id=3 /]

## Overview Presentations

Overview: Massively Parallel Models
Speaker: Cliff Stein, Columbia University

Overview: MPC Lower Bounds
Speaker: Jara Uito, Aalto University

Overview: Computations on GPU and Dynamic Data Structures
Speaker: John Owens, UC Davis USA

Overview: Performance Portability in the Age of Extreme Heterogeneity
Speaker: John Shalf, Berkley National Laboratory

## Talks

Title: BSP Models: Past, Present and Future
Speaker: Bill McColl, Huawei, Switzerland

Abstract: An ever-expanding series of BSP models have been developed over the past 30 years. In the 1990s the emphasis was initially on BSP as a model for automatic-mode virtual shared memory, then later as a model for low-level direct-mode parallel programming based on superstep semantics and one-sided RDMA. In this form, the models were intended to be general-purpose and were aimed primarily at relatively expert programmers working on HPC. An important side effect of this early work on BSP was that it launched research on communication-avoiding parallel algorithms, which has since become a major area.

The next wave of BSP-style parallel programming models such as MapReduce and Spark addressed issues such as ease of programming and fault tolerance, with a willingness to sacrifice some performance at scale in order to achieve these important requirements. Others such as Pregel focused on a particular domain – graph computing, graph analytics, and graph databases, and introduced the vertex-centric style of BSP programming. Today, research on BSP models is expanding in many new directions, with new approaches to shared massive datasets, fault tolerance and tail tolerance, algebraic programming styles, DSLs, and to connections with new compiler technologies such as MLIR. In this talk I will give a personal view of the past, present and future of BSP models.

Abstract: We introduce the Adaptive Massively Parallel Computation (AMPC) model, which is a theoretical model that captures MapReduce-like computation augmented with a distributed hash table. Our model is inspired by the previous empirical studies of distributed graph algorithms [KLMRV, SOCC’14][BBDHKLM, NeurIPS’17] using MapReduce and a distributed hash table.

We show that in the AMPC model several fundamental graph problems such as graph connectivity, minimum spanning forest (MSF), and approximate maximum (weight) matching can be solved in just a constant number of rounds.

We also provide an empirical comparison of the algorithms in the MPC and AMPC models in a fault-tolerant distributed computation environment. We evaluate our algorithms on a set of large real-world graphs and show that our AMPC algorithms can achieve up to an order of magnitude improvements in both running time and round-complexity over optimized MPC baselines.

Title: Supercomputing, Cloud and Edge Security Models Convergence using APIs and Container Technologies
Abstract: Scientific applications and workflows today are exploiting a range of distributed resources including supercomputing, cloud, and edge technologies, such as scientific instruments. User access and security models across these resource domains vary considerably resulting in overhead for the development of scientific applications and workflow engines, often leading to productivity losses. Recently, adoption of zero trust security models have accelerated due to prevalent usage of off-premises cloud and edge technologies. The model relies on having multiple layers of trust and verification for applications, users, devices, and networks. HPC applications, networks, storage, and access controls have unique architectural features for delivering close-to-metal performance and to achieve scalability. In this talk, I introduce motivating use cases, discuss challenges in adopting zero trust security models for HPC, and present solutions that we have developed at the Swiss National Supercomputing Centre. These include a RESTful services gateway called FirecREST and a user-friendly, HPC container engine called Sarus. FirecREST serves as a secure, web-accessing interface to HPC resources providing a coupling layer between the identity and access management of cloud and the essential POSIX interfaces used in HPC. Sarus fulfills unique HPC applications’ portability requirements offering native hardware performance while improving isolation and security in multi-tenant environments.

Title: How to hold CPU vendors’ feet to the Software Spiral fire?
Speaker: Uzi Vishkin, University of Maryland, USA
Abstract: Andy Grove (Intel’s business leader till 2004) termed “software spiral (SWS)” the exceptionally resilient business model behind general-purpose CPUs. Application software is the cornerstone of SWS: Code written once could yet benefit from performance scaling of later CPU generations. But, after nearly two decades of multicore CPUs domination, the jury is out: extension of SWS from serial single core CPUs to parallel general-purpose multicore CPUs has failed, missing goals stated by CPU vendors themselves.
The serial RAM algorithmic model has long been disfavored by architects due to abstracting away architecture features. However, as long as it was part of the specs for single core CPUs, they had to bite the bullet and continue supporting the RAM model. The transition to multicores reversed the flow from algorithm-model-driven specs to architecture. Vendors started forcing algorithmicists and application developers to bend over backwards to meet manycore hardware. The result: too few program today’s manycores for parallelism. Ever changing hardware designs further exacerbated the problem. Not only that convergence to a manycore software spiral was never achieved, it disappeared from vendors’ narratives.
It is finally time to revert back to what worked prior to derailment of the software spiral. General purpose parallel algorithms representing millions of application programmers must drive specs for manycore CPUs. Extensive hardware and software prototyping that my team did at UMD has demonstrated technical feasibility. There is also new hope from the business side. Competition among CPU vendors is getting fierce for the first time since the 1990s. However, the main challenge is to establish an appealing application-based proposition. Introspection on the most impactful recent applications will lead to constructive lessons towards such a proposition.

Title: Correlation Clustering in Constant Many Parallel Rounds
Speaker: Silvio Lattanzi, Google Research, Barcelona, Spain
Abstract: Correlation clustering is a central topic in unsupervised learning, with many applications in ML and data mining. In correlation clustering, one receives as input a signed graph and the goal is to partition it to minimize the number of disagreements. In this work we propose a massively parallel computation (MPC) algorithm for this problem that is considerably faster than prior work. In particular, our algorithm uses machines with memory sublinear in the number of nodes in the graph and returns a constant approximation while running only for a constant number of rounds. To the best of our knowledge, our algorithm is the first that can provably approximate a clustering problem using only a constant number of MPC rounds in the sublinear memory regime. We complement our analysis with an experimental scalability evaluation of our techniques.

Title: Connections between LOCAL and Low-Space Massively Parallel Computation
Speaker: Peter Davies, University of Surrey
Abstract: In recent years, the fields of distributed and parallel graph algorithms, while always linked, have grown increasingly close. In particular, the emergent Massively Parallel Computation (MPC) model has been shown to have a strong connection to the classic LOCAL model of distributed computing. Under some restrictions, LOCAL algorithms can be run at an exponential speedup in MPC, via a process known as graph exponentiation. Furthermore, recent work has suggested that for many problems, this might be the best that one can hope to do in low-space MPC.

This talk will begin with an overview of what is known about this connection, both from an upper- and lower-bounds perspective. We will then discuss some recent results (joint work with Artur Czumaj and Merav Parter) showing that the power of low-space MPC is not limited to graph exponentiation, and demonstrate examples of algorithms which are super-exponentially faster than their LOCAL counterparts. Finally, we will consider the implications for designing algorithms or proving lower bounds in MPC.

Title: Work on Foundations of Processing-in-Memory, which includes model of computation, new algorithms, and performance evaluations on a real PIM system
Speaker: Phil Gibbons, CMU, USA
Abstract: TBD

Title: The Migratable Objects Programming Model for a broad spectrum of large computations
Speaker: Sanjay Kale, UIUC, USA
Abstract: The Migratable objects Programming Model is easy to describe: The computation in such a model consists of a set of objects that interact with each other primarily via asynchronous method invocations, Other communication mechanisms such as channels, future-based communication etc. are built on top of it The objects are organized in multiple indexed collections. Importantly, the placement of objects to processors is under the control of an adaptive runtime system, which may choose to migrate objects across processors and nodes as it sees fit. This can be used to effect dynamic load balancing, tolerating faults, shrinking and expanding the set of processors use as well as energy/power/thermal optimizations.

Implementation of this model in in Charm++ has been demonstrated in multiple highly scalable science applications, including those in biomolecular simulation (NAMD), computational astronomy (ChaNGa), etc. More recently, Charm4Py has deployed the same model for Python applications. But it is also a good for broader distributed computing applications. It has excellent potential for expressing graph algorithms, machine learning/training computations, and distributed agent models. It derives its strength from separation of concerns between the physical and virtual notion of resources, with fine grained low-overhead scheduling and support for compositionally. The talk will describe the model and explain this potential.

Title: Almost 3-Approximate Correlation Clustering in Constant Rounds
Abstract: We study parallel algorithms for correlation clustering. In this problem, each pair among n objects is labeled as either “similar” or “dissimilar”. The goal is to partition the objects into arbitrarily many clusters while minimizing the number of disagreements with the labels.

I will describe an algorithm that obtains an (almost) 3-approximation in constant rounds (of models such as massively parallel computation, local, and semi-streaming). This is a culminating point for the rich literature on parallel correlation clustering as the approximation (almost) matches a natural barrier of 3 for combinatorial algorithms and the round-complexity is constant.

Title: MapReduce Algorithms in Fewer Rounds
Abstract: One of the most important parameters in the MapReduce model of computation is the number of rounds taken by an algorithm. In this talk we consider a few combinatorial optimization problems and show round-efficient MapReduce algorithms. These algorithms are essentially designed from their sequential, i.e., round-inefficient, counterparts.

Title: HPC/BD/AI supported by Big Memory Supercomputer
Speaker: Taisuke Boku, Tsukuba University, Japan
Abstract: One of the biggest issues for high performance data science
and AI is the performance and capacity of storage, both for main
memory and I/O storage. For the main memory, traditional DRAM solution
limits the capactiy even though the bandwidth and latency are almost
sufficient. About I/O storage, SSD technology provides high
performance for local storage, however we need a distributed one for
large scale parallel processing.

One of the novel technologies to apply for these issues is the
persistent memory (PMEM). PMEM is attachable to CPU directly through
ordinary memory bus while it provides much larger capacity than DRAM
although the latency is sightly lower. There are several usage of this
new device such as large capactiy directly accessable memory, coupling
with DRAM as a sort of cache to compensate the speed of PMEM, or a
local I/O storage device replacing SSD.

In the Center for Computational Sciences (CCS) at University of
Tsukuba, we are planning to introduce a new supercomputer named
“Cygnus-BD” toward a new method of big data science and AI supported
by PMEM technology. We call this machine as Big Memory Supercomputer,
where BD of the system name stands for Big Data. We combine every
technology supported by PMEM both for large capacity local memory
coupled with DRAM and also implementing high performance distributed
shared storage as an ad hoc file system. In this talk, I will
introduce how PMEM can be applied for such applications and systems
as well as introduction of our Cygnus-BD system plan.

Title: Massively Parallel Algorithms for Locally Checkable Graph Problems
Speaker: Jara Uitto, Aalto University, Finland
Abstract: We consider a natural family of Locally Checkable Labeling (LCL) problems on graphs.
A solution to an LCL problem is an assignment of labels to each node (or edge), from a finite set of labels.
The correctness of a solution can be verified by observing the labels in the neighborhood of each node.
The class of LCL problems captures many of the central graph problems studied in distributed and parallel computing, including graph coloring, maximal matching, and maximal independent set.

In the last decade, there has been tremendous progress in the study of LCL problems in classic message-passing models of distributed computing such as the LOCAL model.
The round complexity of each LCL problem falls into one of the following categories: $\Theta(1), \Theta(\log^* n), \Theta(\log \log n), \Theta(\log n)$, or $\Theta(n^k)$, for some constant $k \leq 1$.
In particular, the landscape of complexities exhibits gaps, that is, there are no LCL problems with complexity, for example, $\omega(1)$ and $o(\log^* n)$.
Inspired by these results, we study LCL problems in the most restricted low-space variant of the Massively Parallel Computation (MPC) model.
We show that any LCL problem with LOCAL complexity $f(n)$ can be solved in $O(\log f(n))$ MPC rounds.
These results are based on novel rooting techniques and efficient computation of the classic rake-and-compress decomposition by Miller and Reif.
All of our algorithms work with optimal total memory, i.e., the sum of memories per machine is linear in the input size.

This talk is based on joint work with Sebastian Brandt and Rustam Latypov as well as on work-in-progress.

Title: Distributed matrix multiplication
Speaker: Keren Censor-Hillel, Technion, Israel
Abstract: In this talk I will describe the state of the art algorithms for distributed matrix multiplication, some of their applications, and the many open questions.

Title: Massively parallel graph connectivity
Speaker: Krzysztof Nowicki, University of Copenhagen, Denmark
Abstract: This talk is about MPC algorithms for problems related to graph connectivity.
The MPC model is a model used for studying complexity of parallel algorithms for big data sets (e.g. algorithms for frameworks like MapReduce, Hadoop, or Spark). In this model, computation is performed by a set of machines in synchronous rounds, each round consists of a phase of local computation and a phase of communication.
In this talk I will present a simple MPC algorithm for spanning trees and edge connectivity that uses machines with a fairly large local memory and briefly discuss the state of the art for MPC with low memory machines.
Furthermore, I will address the 2-cycle conjecture and its implications toward the hardness of problems in MPC with low memory machines, and possible ways of bypassing this hardness in some practical scenarios.

Title: Massively Parallel Algorithms for Maximal Matching and Edit Distance
Speaker: MohammadTaghi Hajiaghayi, University of Maryland, USA
Abstract: In this talk we will discuss the recent algorithmic progress made on the Massively Parallel Computations (MPC) model. The MPC model provides a clean theoretical abstraction of modern parallel computation frameworks such as MapReduce, Hadoop, Spark, etc., which have been extremely successful in processing large-scale data-sets.

Our main focus in the talk will be on the maximal matching problem. We give an outline of the analysis of an extremely simple algorithm, improving exponentially over previous maximal matching results. The analysis is based on a novel method of proving concentration bounds for algorithms satisfying a certain “locality” property, which we believe may find applications beyond the MPC model. We will also survey some other recent results in the area. Particularly, we will overview an algorithm for edit distance and longest common subsequence with almost tight bounds.

Title: Parallel Batch-Dynamic k-Core Decomposition
Speaker: Julian Shun, MIT, USA
Abstract: In this talk, we present the first parallel batch-dynamic algorithm for approximate k-core decomposition that is efficient in both theory and practice. Our algorithm is based on our novel parallel level data structure, inspired by the sequential level data structures of Bhattacharya et al. and Henzinger et al. Given a graph with n vertices and a batch of B updates, our algorithm maintains a (2 + epsilon)-approximation of the coreness values of all vertices (for any constant epsilon > 0) in O(B log^2(n)) amortized work and O(log^2(n) loglog(n)) span (parallel time) with high probability. We implement and experimentally evaluate our algorithm, and demonstrate significant speedups over state-of-the-art serial and parallel implementations for dynamic k-core decomposition. Using our parallel level data structure, we also obtain new parallel batch-dynamic algorithms for low out-degree orientation, maximal matching, clique counting, and graph coloring.

Title: Algorithmic applications of compact graph representations
Speaker: Alex Andoni, Columbia University, USA
Abstract: A now-classic approach to improving algorithm complexity is via
compact graph representations: smaller graphs that preserve certain
properties. Perhaps the most famous example is a graph spanner: a
sparse(r) graph that approximately preserves the shortest-path
distances between the nodes of the graph. This object (together its
extension and a corresponding fast data structure of distance
oracles) has been studied for decades.

In this talk we discuss a couple types of such spanners that have
applications to parallel and near-linear time algorithms. In
particular, we introduce lop-hop emulators — a spanner-like graph
where all shortest paths have few hops — and show its applications
to parallel shortest paths (and other) algorithms. We also highlight
spanner for geometric datasets, with applications to the Earth-Mover
Distance problem (aka, optimal transport, Wasserstein distance).

Based on joint work with Jaroslaw Blasiok, Arnold Filtser, Cliff
Stein, Peilin Zhong.

Title: New Techniques in Massively Parallel Algorithms Through the Lens of Graph Coloring
Speaker: Sepehr Assadi, Rutgers University, USA
Abstract: The Massively Parallel Computation (MPC) model is a model of computation for processing massive datasets along the lines of practical frameworks such as MapReduce, Hadoop, or Spark. Motivated by this, there has been a surge of interest in recent years on designing MPC algorithms for various graph problems on massive graphs.

In this talk, I will give a brief overview of some general techniques for designing MPC algorithms for graph problems, while focusing on the graph coloring problem as a canonical example of a problem that captures many of these techniques.

Title: Learning-Based Algorithms
Speaker: Piotr Indyk, MIT, USA
Abstract: Classical algorithms typically provide “one size fits all” performance, and do not leverage properties or patterns in their inputs. A recent line of work aims to address this issue by developing algorithms that use machine learning predictions to improve their performance. In this talk I will present examples of results this type, in the context of streaming and sampling algorithms. In particular, I will show how to use machine learning predictions to improve the performance of algorithms for counting triangles in a graph.

Title: Challenges in Using Hierarchical Programming Approaches for Colossal Scale Systems
Speaker: Bill Gropp, UIUC, USA
Abstract: With the plateau in clock speed for individual computing elements, performance increases have relied on increases in parallelism and in processor specialization. Systems typically contain large numbers of nodes, each node with multiple chips, often combining different processing approaches, and each chip implementing several levels of parallelism. Further complexity comes from different memory and data systems. One approach to program such systems is to combine programming models, targeting the different elements. However, these systems are not designed to be part of a combined programming model, which impacts both their efficiency, their usability by programmers, and even the choice of algorithms for the application. This talk will outline some of the challenges and suggest research directions.