Notes: Pure - Improving Message Passing to Better Utilize Intra-Node Shared Memory

# Citation

James Psota and Armando Solar-Lezama. 2024. Pure: Evolving Message Passing To Better Leverage Shared Memory Within Nodes. In Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP ‘24). Association for Computing Machinery, New York, NY, USA, 133–146. https://doi.org/10.1145/3627535.3638503

# Keywords

Parallel programming model
Distributed runtime system
Task-based parallel model
Concurrent data structures
Lock-free data structure

# Summary

Pure is a new programming model and runtime system designed to fully utilize shared memory within nodes in environments based on the Message Passing Interface (enhancing task usage to leverage idle core capabilities). Pure leverages shared memory in two ways: (1) allowing ranks to steal work from each other while waiting for messages to arrive; (2) using efficient lock-free data structures to achieve high-performance message passing and collective operations among processes within a node. Researchers evaluated Pure’s key message passing and collective features through micro benchmark tests and demonstrated that in CoMD molecular dynamics and miniAMR adaptive mesh refinement applications, Pure can achieve up to 2.1x application speedup when scaling to 4096 ranks.

# 1. Introduction

In recent decades, the field of high-performance computing has transitioned from large vector computers to clusters composed of single processors interconnected through networks. MPI has become the de facto standard for parallel programming on distributed memory systems. With advancements in hardware, the emergence of multi-core clusters has allowed cores within nodes to share memory and communicate over networks, prompting the community to continually seek new paradigms to more efficiently utilize modern cluster resources. Currently, there are two main strategies: one is to maintain a unified MPI programming approach by improving the MPI runtime system to better utilize shared memory; the other is to adopt hybrid programming models like MPI+X, using shared memory parallelism within nodes while continuing to use MPI between nodes. However, these approaches may either be limited by the MPI standard’s specifications on interface behavior, preventing performance maximization, or present challenges to programmers in managing two programming models.

The community has tried many other methods, including the PGAS model, which provides a shared memory abstraction across clusters, and implicitly parallel programming languages like Legion, Chapel, and X10, which offer high-level abstractions and attempt to automatically and efficiently coordinate applications. Despite some progress, many modern HPC applications still rely on MPI. MPC and AMPI also attempt to improve performance by using threads as MPI Rank to leverage internal shared memory.

However, using only MPI methods often performs better than hybrid programming methods. This may be due to the limitations of the interface and the inability to fully utilize shared memory within nodes, causing MPI to fail to fully realize its potential performance. Therefore, the Pure system proposed in this article is built based on the MPI-everywhere method, breaking some traditional assumptions of MPI, more effectively utilizing shared memory, while avoiding the need for major restructuring of existing programs. Pure adopts a programming model similar to MPI, thus enabling the use of the existing MPI knowledge and application base of the HPC community.

The design inspiration for Pure comes from MPI, with its core programming model based on message passing, and optionally integrating task parallelism. Unlike MPI, Pure abandons the use of process-level ranks and the limitation of supporting legacy languages, opting instead to implement ranks using threads rather than traditional processes. This shift allows Pure to efficiently adopt lightweight lock-free synchronization mechanisms to coordinate between threads within the same node. Utilizing this threaded rank architecture, Pure constructs efficient intra-node collective operations and optimizes the performance of these operations through lock-free algorithms. Additionally, Pure supports running portions of parallel code blocks in the application as standard C++ lambda expressions, which can be executed automatically and concurrently by the current rank-holding thread as well as other idle ranks, all of which are automatically scheduled by the Pure Runtime system.

The optimization strategies proposed in the paper cover the following points:

A lock-free messaging method suitable for the transmission of small messages and large data messages.
Lock-free data structure, used for efficient implementation of collective communication algorithms.
A lock-free task scheduler that allows idle threads to efficiently “steal” workloads from other threads.

The author uses the standard C++ library to ensure the wide compatibility of Pure and demonstrates that Pure has a significant performance improvement compared to highly optimized MPI benchmarks. In addition, the author shows that the Pure programming model is semantically very similar to MPI, which means that migrating from existing applications to Pure is straightforward and simple, further demonstrated by the source-to-source conversion tool mpi2pure. Overall, the main contributions of the paper can be summarized as follows:

A new programming model and runtime system have been proposed, which effectively combines message passing and task parallelism, and utilizes features of standard C++ to implement it.
Demonstrates how modern C++ supports more flexible parallel runtime system interfaces.
Describes a well-designed lock-free, multithreaded, and distributed runtime system that shows significant speed improvements over MPI within nodes.
It has been proven that by making minimal source code modifications to existing MPI applications, significant performance improvements can be achieved in micro benchmark tests and three real-world applications compared to state-of-the-art MPI implementations.

# 2. Pure Usage Example

This section illustrates the use of Pure through a simple 1-D Stencil algorithm example. Although this example is simple, it clearly demonstrates the core concepts of Pure and its similarities with MPI, laying the foundation for developers to write more complex applications.

In the MPI version implementation code rand_stencil_mpi, the main computational work is performed in the function random_work. In simple terms, the rand_stencil_mpi function first enters a loop, iterating iters times, calculating random_work on each element of the array a. It is noteworthy that the execution time of random_work is variable and unknown, which introduces load imbalance. Moreover, random_work does not modify the contents of the array a, but instead updates the array a by averaging adjacent elements. Finally, the program uses MPI_Send and MPI_Recv to exchange the first and last elements of the temp array to compute the first and last elements of the array a. Due to the varying time required by random_work, some processing units will complete tasks early and sometimes become blocked while waiting for the slower sender in an MPI_Recv call.

Note
Example 1: 1-D Stencil with Random Work, MPI Version

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
void rand_stencil_mpi(double* const a, size_t arr_sz, size_t iters, int my_rank,
                      int n_ranks) {
  double temp[arr_sz];
  for (auto it = 0; it < iters; ++it) {
    for (auto i = 0; i < arr_sz; ++i) {
      temp[i] = random_work(a[i]);
    }
    for (auto i = 1; i < arr_sz - 1; ++i) {
      a[i] = (temp[i - 1] + temp[i] + temp[i + 1]) / 3.0;
    }
    if (my_rank > 0) {
      MPI_Send(&temp[0], 1, MPI_DOUBLE, my_rank - 1, 0, MPI_COMM_WORLD);
      double neighbor_hi_val;
      MPI_Recv(&neighbor_hi_val, 1, MPI_DOUBLE, my_rank - 1, 0, MPI_COMM_WORLD,
               MPI_STATUS_IGNORE);
      a[0] = (neighbor_hi_val + temp[0] + temp[1]) / 3.0;
    }  // ends if not first rank
    if (my_rank < n_ranks - 1) {
      MPI_Send(&temp[arr_sz - 1], 1, MPI_DOUBLE, my_rank + 1, 0,
               MPI_COMM_WORLD);
      double neighbor_lo_val;
      MPI_Recv(&neighbor_lo_val, 1, MPI_DOUBLE, my_rank + 1, 0, MPI_COMM_WORLD,
               MPI_STATUS_IGNORE);
      a[arr_sz - 1] =
          (temp[arr_sz - 2] + temp[arr_sz - 1] + neighbor_lo_val) / 3.0;
    }  // ends if not last rank
  }    // ends for all iterations
}

Example 2 demonstrates the Pure version that achieves the same functionality. There are some key differences. First, the message calling function interface is different, using the corresponding Pure message passing functions pure_send_msg and pure_recv_msg, instead of MPI calls, but the parameters are essentially the same as the corresponding MPI functions. The message passing semantics of Pure are similar to MPI: the sender’s buffer is copied to the receiver’s buffer. The main implementation difference is that Pure uses a lightweight message passing method within the node, resulting in lower latency for message passing within the node compared to MPI.

Note
Example 2: Pure Version

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
void rand_stencil_pure(double* a, const int arr_sz, const int n_iter,
                       const int my_rank, const int n_ranks) {
  double temp[arr_sz];
  PureTask rand_work_task = [a, temp, arr_sz, my_rank](
                                chunk_id_t start_chunk, chunk_id_t end_chunk,
                                std::optinal<void> cont_params) {
    auto [min_idx, max_idx] =
        pure_aligned_idx_range<double>(arr_sz, start_chunk, end_chunk);
    for (auto i = min_idx; i < max_idx; i++) {
      temp[i] = random_work(a[i]);
    }
  };  // ends defining the Pure Task for rand_work_task
  for (auto it = 0; it < n_iter; it++) {
    rand_work_task.execute();  // execute all chunks of rand_work_task
    for (auto i = 1; i < arr_sz - 1; ++i) {
      a[i] = (temp[i - 1] + temp[i] + temp[i + 1]) / 3.0;
    }
    if (my_rank > 0) {
      pure_send_msg(&temp[0], 1, MPI_DOUBLE, my_rank - 1, 0, PURE_COMM_WORLD);
      double neighbor_hi_val;
      pure_recv_msg(&neighbor_hi_val, 1, MPI_DOUBLE, my_rank - 1, 0,
                    PURE_COMM_WORLD);
      a[0] = (neighbor_hi_val + temp[0] + temp[1]) / 3.0;
    }  // ends if not first rank
    if (my_rank < n_ranks - 1) {
      pure_send_msg(&temp[arr_sz - 1], 1, MPI_DOUBLE, my_rank + 1, 0,
                    PURE_COMM_WORLD);
      double neighbor_lo_val;
      pure_recv_msg(&neighbor_lo_val, 1, MPI_DOUBLE, my_rank + 1, 0,
                    PURE_COMM_WORLD);
      a[arr_sz - 1] =
          (temp[arr_sz - 2] + temp[arr_sz - 1] + neighbor_lo_val) / 3.0;
    }  // ends if not last rank
  }    // ends defining the Pure Task for rand_work_task
}

The more important difference is the addition of Pure Task in Pure, which uses a lambda expression defined with a set of specific parameters. It leverages the capture parameter feature of lambda, allowing variables external to the lambda body to be captured by value or reference and used when the lambda is executed. Pure Task can be viewed as a snippet of application code executed by the Pure Runtime system and can be executed concurrently through multithreading. Therefore, Pure tasks should be structured in a data-parallel-like form. Additionally, Pure Task requires the programmer to ensure thread safety.

In the above Pure implementation, programmers can use chunk ranges to describe concurrency. These subranges or chunks are passed to the Pure Task through the start_chunk and end_chunk parameters, and they are provided by the Pure Runtime system. The Pure Runtime system is responsible for ensuring that all work is completed smoothly. Since multiple different threads may be involved, the Pure Runtime system achieves this by tracking which chunks have been allocated and completed.

Secondly, programmers need to map the start_chunk and end_chunk parameters provided by the Pure Runtime system to specific content related to application computation. Here, the code uses the pure_aligned_idx_range helper function to convert them into loop subranges. This helper function takes cache lines into account, which helps avoid false sharing issues.

Due to random_work potentially causing uneven load distribution, some ranks may be idle while waiting for messages. The Pure task scheduler will automatically utilize these idle ranks to execute other pending Pure task blocks within the same node. Take the example of three ranks within the same node in the diagram below: rank 0 is executing a Pure Task divided into 6 chunks, while rank 1 and rank 2 are blocked due to receiving messages.

timeline-2024-03-14 — Example Pure code timeline diagram

From the diagram, the following execution flow can be clearly seen:

rank 0 begins processing the first chunk (chunk 0).
At the same time, rank 1 steals and executes the second chunk (chunk 1) in parallel.
The task scheduler then assigns the third chunk (chunk 2) to rank 0 and the fourth chunk (chunk 3) to rank 1.
rank 2 attempts to steal a task and successfully executes the fifth chunk (chunk 4). Due to the randomness of random_work execution, chunk 2 and chunk 4 might be time-consuming tasks.
rank 0 completed the processing of chunk 5, which is a smaller task block, and it finished before rank 2 completed chunk 4.
The task scheduler ensures that rank 0 does not finish execution before all chunks are completed. In fact, rank 0 has to wait until chunk 4 is completed before it can continue.
During the process of waiting for messages at rank 1 and rank 2, they will attempt to steal more chunks from any other available rank.
Thanks to the variable capture feature of lambda expressions, context information can be efficiently shared between different ranks.

The experimental results show that on a single node configured with 32 ranks, the Pure version achieves a 10% performance improvement compared to the MPI version due to faster message passing and parallel execution of Pure Tasks. In scenarios with uneven load distribution, the acceleration ratio of Pure even exceeds 200%. Although the degree of these performance improvements is affected by load imbalance, in practical application scenarios, Pure still demonstrates significant performance enhancements. This is attributed to the capability of the Pure Runtime system, which can automatically detect and efficiently utilize underutilized computational resources.

# 3. Programming Model

The core of Pure’s programming model is “message passing combined with optional task parallelism.” Semantically, Pure’s message passing and collective communication operations are equivalent to MPI, with differences mainly in some syntactical details.

Although Pure uses threads within nodes, its rank namespace remains non-hierarchical across the entire cluster. During the execution cycle of a Pure program, the number of ranks remains unchanged.

The Pure application is written in C++ and runs using the SPMD (Single Program Multiple Data) model, achieving internal multithreading. On the same node, all ranks are implemented through kernel threads.

It is important to note that Pure applications do not support global variables. Therefore, developers should remove global variables or use the thread_local keyword to limit the scope of variables, ensuring thread safety.

For applications with load imbalance issues, developers can use Pure Task in parts of the program that meet the following specific conditions:

Compute-intensive hotspot areas.
Tasks that can be executed concurrently.

# Message passing and collective communication operations

In Pure, the pure_send_msg and pure_recv_msg functions correspond functionally to MPI’s MPI_Send and MPI_Recv, and Pure also provides corresponding non-blocking versions.

Pure Runtime system ensures that all messages are delivered and in the order they are sent. Pure also implements a series of collective communication operations, including:

Reduce
All-Reduce
Barrier
Broadcast

In addition, Pure introduced the concept of a communication subgroup, allowing developers to further subdivide a communication subset into smaller subsets through the pure_comm_split function.

In order to use Pure, the application needs to be written using modern C++ standards, and it is recommended to compile using std=c++11 or a higher version. Pure provides a Make-based build system that automatically configures appropriate compiler options and links to the Pure Runtime system (libpure), while defining a series of targets for debugging and performance analysis.

# Pure Task

Pure Task allows developers to define the computational parts of an application and break them down into chunks that can be executed in parallel. These chunks can be automatically executed concurrently by the Pure Runtime system.

However, Pure Task is not necessary and is only recommended when a task can be divided into multiple smaller chunks and doing so helps alleviate load imbalance issues.

Pure Task is implemented through C++ Lambda expressions and synchronously executed when the execute method is called on the rank that owns the task. Each rank can only execute one Pure Task at a time. The variable capture feature of Lambda expressions allows different ranks to efficiently share context information when executing different chunks. Typically, a Pure Task is defined once during the application’s runtime and then executed multiple times at each timestep or other iterations.

When defining a Pure Task, you need to specify the number of chunks and additional application parameters. Tasks should avoid interdependence; however, because they are fully executed during the execute call, they will not conflict with code outside of the tasks.

Pure Task contains an execute method, which accepts a parameter per_exe_args of type optional<void*>, used to pass additional arguments each time the task is executed. This is very useful when the input values of the task body change during consecutive executions. For example, a developer can pass a pointer to a local structure to the execute method.

The first two parameters of a Pure Task, start_chunk and end_chunk, are unsigned integers used to specify the range of chunks to execute. These chunks are allocated by the Pure Runtime system, ensuring that each chunk is executed only once, even if they may be executed concurrently.

Pure Task uses chunk ranges to provide flexibility to the scheduler, allowing multiple chunks to be allocated at once. The number of chunks is determined by the Pure task scheduler, but will not exceed the PURE_MAX_TASK_CHUNKS predefined in the Makefile.

Currently, the Pure Task interface requires manually mapping chunk numbers to array indices, which can be cumbersome when dealing with multidimensional arrays. Therefore, the future goal is to extend the interface to provide a more concise and higher-level interface similar to TBB’s parallel_for.

Finally, developers need to ensure that the internal implementation of Pure Task is thread-safe to avoid mutual contention between chunks of the same task being executed concurrently. For example, in the CoMD molecular dynamics benchmark, the issue of multiple threads writing to the same memory location simultaneously needs to be addressed, and in such cases, an std::atomic array can be used to replace a regular int array.

# 4. Runtime System

The Pure runtime system is a dynamic library for multithreaded and distributed runtime, used to support the development of Pure applications. Developers need to include the pure.h header file when using it, compile with the C++17 standard, and link to the libpure library. The Pure runtime system can automatically find and exploit opportunities for overlapping execution between computation and communication operations, especially in cases of high communication latency.

The main functions of the Pure runtime system include:

Initialize and configure the necessary processes and threads, start the application.
Communication and collective operations between ranks within the management node.
Manage internal memory buffers and data structures.
If a Pure Task is defined in the application, the runtime system is also responsible for scheduling and executing these tasks.

# Rank initialization and mapping

In Pure, rank is implemented as the kernel thread of an MPI process. In multi-node applications, Pure runs MPI to handle cross-node communication, whereas in single-node applications, MPI is not used. Nonetheless, Pure applications do not directly call MPI functions. Through Makefile configuration, a Pure program can start an MPI process on a node or NUMA node and create a corresponding number of threads based on the number of cores per node or NUMA node. For application developers, they only need to understand the non-hierarchical rank namespace, while underlying concepts such as nodes, threads, MPI processes, and communication latency are abstracted and transparent to the developers.

Pure supports flexible rank-to-node mapping strategies and defaults to using an SMP-style allocation strategy. Additionally, Pure supports custom rank mapping, including the use of CrayPAT rank reordering files. While these hardware-related details are invisible to developers, Pure internally utilizes this information to optimize key functionalities.

When a Pure application starts, the original main function of the application is not executed directly. Instead, the underlying MPI program calls the main function defined in the Pure runtime system, which is responsible for initializing Pure’s core data structures. It then creates and binds threads, each executing an original_main function, which is a renamed version of the original main function from the application code. After the application completes execution, the original_main function returns to the Pure runtime system, which is responsible for completing the MPI cleanup and termination process.

# Spin-Steal Waiting Loop (SSW-Loop)

When a Pure rank encounters a blocking event, such as waiting for a message to arrive, it will execute a mechanism called the Spin, Steal, Wait Loop (SSW-Loop) instead of simply entering an idle state. In this loop, the rank checks if the blocking condition is met, such as whether a message has arrived, and if not, it will attempt to steal tasks from other ranks. If a blocking rank can assist in completing tasks that other threads in its process are concurrently executing, it will participate in such assistance work.

Since threads are bound to specific CPUs and each rank runs only one application, Pure chooses to have ranks actively spin-wait rather than relinquish the CPU. The SSW-Loop gives ranks in computation “polymorphism”: they can act as computing nodes for the main program, assist other ranks in executing stolen task blocks, and then return to check their own blocking events.

Pure follows the strategy of prioritizing the stolen workload of the current rank and adheres to the scheduling principle of workload priority.

Unlike systems that use auxiliary threads to achieve workload stealing or communication, Pure is characterized by allowing application-level compute nodes to directly perform task stealing operations.

# Implementation Instructions

Pure is written using the C++17 standard library. The Pure runtime system consists of about 21,000 lines of source code, and the Pure tools contain about 14,000 lines of source code. Pure has been tested in various environments, including laptops and clusters, and only requires a C++17 supporting compiler, a Unix-like operating system, and an MPI environment to run. The source code for Pure is publicly available on GitHub at the following link: https://github.com/psota/pure .

# Point-to-point communication

Pure provides blocking and non-blocking point-to-point message passing functionality, consistent with the message passing semantics of MPI.

Pure internally uses three different strategies for message passing, and the choice of strategy depends on the size of the message and whether the sender and receiver are located on the same node.

Pure allocates and reuses a persistent Channel object throughout the program’s lifecycle, which is stored in the runtime system. The internal Channel Manager is responsible for mapping message parameters to the appropriate data structures and creating these structures as needed.

strategy-2024-03-15 — Pure Message Passing Strategy

Short message (less than 8KB):
- Use a lock-free circular queue (PureBufferQueue, PBQ), with acquire-release memory semantics. The sending thread copies the message to the PBQ when there is available space, and the receiving thread retrieves it when the message is ready.
  - In short message passing, the overhead of copying is relatively small, allowing the sender to immediately perform other useful work after the call returns.
- Both sending and receiving threads use SSW-Loop to wait in order to achieve as much overlap of computation and communication as possible.
- The slots for all messages are stored in a contiguous buffer, with pointer arithmetic ensuring that each slot is aligned with cache line boundaries, avoiding false sharing between sending and receiving threads.
Big Message (greater than or equal to 8KB):
- A strategy similar to PBQ, but using a single memory copy directly from sender to receiver, inspired by the rendezvous mode of MPI.
- Use a lock-free fixed-size circular buffer to store the receiver’s receive call parameters.
- The sender waits for a metadata queue item via SSW-Loop and then copies the message content directly to the receiver’s buffer. The sender notifies the receiver that the transfer is complete by inserting the number of bytes transferred into the lock-free queue.
Cross-node message
- Transparently use the MPI interface for message passing.
- During Pure initialization, use a distributed consensus algorithm to create a thread-rank-process-node mapping data structure that maps Pure rank to MPI rank.
- To ensure that the correct receiving thread on the receiving node can receive the message, encode the sending and receiving thread IDs in MPI_TAG to solve the multithreading routing problem.

# Collective communication

The collective communication operations of Pure are semantically the same as MPI, but they are implemented within nodes using a bottom-up constructed data structure. This has shown significant performance improvements in both single-node and multi-node benchmarks, even though it still relies on MPI’s collective operations for cross-node communication.

Pure uses a leader thread to coordinate the collective communication process, while other threads assist with computation and call MPI collective functions as needed.

Pure uses a static leader election method, which is more efficient than the compare-and-swap based “first-come” method.

The following is an example using All-Reduce, other collective communication operations have similar concepts.

For small data All-Reduce operations, Pure designed a concurrent data structure called Sequenced Per-Thread Dropbox (SPTD), providing an efficient lock-free mechanism for pairwise synchronization and optionally sharing data between leader threads and other non-leader threads.

SPTD-2024-03-15 — Sequenced Per-Thread Dropbox (SPTD)

This method draws on the flat-combinding technique, using thread 0 in the communicator as the leader thread.

For small arrays not exceeding 2KB:
- Non-leader threads first copy data to SPTD, then synchronize with the leader thread, indicating that the input data is ready (using atomic sequence numbers instead of a shared atomic counter).
The leader thread performs an element-wise Reduce operation on all input arrays.
- Each node’s leader thread uses MPI_Allreduce to perform a global Reduce on local Reduce results.
- Leader thread synchronization, non-leader threads copy the final Reduce result to a private buffer.
All threads execute the SSW-Loop while waiting.
For large arrays exceeding 2KB, Reduce computation may become a performance bottleneck. Therefore, it is necessary for all threads to execute Reduce computation concurrently and read from or write to data directly from each thread’s buffer through shared memory.
- Reduce work is divided into equally sized blocks to avoid false sharing and enable vectorized computation.
Threads report readiness status using SPTD and mark computation completion with atomic sequence numbers.
- The leader thread calls MPI_Allreduce to perform an All-Reduce operation across nodes and propagates the final result through another atomic sequence number.

# Task Scheduler

The Pure runtime system has meticulously designed a task scheduler that maintains an array named active_tasks in shared memory. This array stores a series of atomic pointers, each corresponding to a task being executed, and allocates an entry for each node and each rank in the system. These entries are initially set to nullptr, indicating that a task has not yet been assigned.

When a task is created and prepared for execution, the system initializes its state and updates the corresponding entry in the active_tasks array through an atomic operation to reflect that the task has been assigned. This update process ensures that the execution state of the task is visible to all threads in the system, allowing the task to be “stolen” by other threads.

During the execution of the task, the rank owning the task will begin executing a series of chunks, which are the subdivided work units of the task. Meanwhile, other threads will continuously check the active_tasks array during their SSL-Loop, using atomic load operations to look for executable non-empty tasks.

The execution of the task is coordinated by two atomic integers, curr_chunk and chunks_done. The owner rank of the task and the possible thief ranks will both run the same concurrent execution function. The thief threads will return after executing one chunk, while the owner thread will continue executing until all chunks are completed. By using the fetch_add operation, a thread can determine which chunk it should execute. If the value of curr_chunk has already exceeded the total number of chunks, the thread will stop executing.

Each time a chunk is successfully completed, the thread atomically increments the value of chunks_done. The owner thread updates its local storage to avoid cache misses. Finally, the owner rank will wait until all chunks are executed, ensuring the complete execution of the task.

It is worth noting that the task’s chunk and the application’s rank are executed on the same hardware thread. In Pure applications, each hardware thread is assigned to a specific rank. Although currently Pure does not utilize hardware accelerators (such as GPUs) to accelerate task execution, the designers believe that Pure’s architecture is fully capable of supporting such acceleration.

The Pure task scheduler provides various execution modes and stealing algorithms to accommodate different execution needs. For example, the author implemented a single chunk execution mode and a guided self-scheduling mode, the latter being a work partitioning algorithm that prioritizes the allocation of larger work chunks followed by smaller ones. Additionally, the scheduler includes a NUMA-aware stealing mode, which prioritizes stealing tasks from threads on the same NUMA node, and a “sticky” stealing mode, allowing thief threads to return to tasks they recently stole that are still active. These features collectively ensure the efficiency and flexibility of task scheduling.

# Evaluation

The performance evaluation of Pure was conducted on the Cori HPC cluster at Berkeley NERSC. This cluster consists of 2388 nodes, each configured with 2 sockets, 16 cores, and 128GB of memory, interconnected via Cray Aires. The experimental configuration enabled hyper-threading and adopted a 256-bit vector width. Two processes were run on each node, totaling 32 threads. The evaluation used the Intel compiler and Cray MPICH as the performance baseline.

# NAS DT benchmark results

By optimizing the message-passing mechanism alone, Pure achieved a performance boost of 11% to 25%.
After introducing Pure Tasks, the performance acceleration ratio increased to 1.7 times to 2.6 times.
Auxiliary threads can slightly improve performance, but only if there are remaining unused CPU cores available. Here, except in the case of 80 ranks where 24 cores were idle, the CPU cores were fully utilized in other cases.

# CoMD and miniAMR benchmarks

-In the CoMD molecular dynamics application, Pure outperforms using only MPI and MPI+OpenMP in terms of performance across all ranks, achieving speedups of 7% to 25% and 35% to 50% respectively, even in the absence of load imbalance.

In the miniAMR adaptive mesh refinement application, Pure achieved at least a 20% and at most a 50% performance acceleration.

# Collective communication performance

Pure exhibits outstanding performance in collective communication operations, where its internal optimization mechanisms and data structure design allow Pure to demonstrate significant efficiency and advantages when handling large-scale parallel computing tasks.

Category	Related Work	Advantages of Pure
MPI	1. Utilize shared memory within multi-core nodes to enhance performance; 2. XPMEM significantly enhances intra-node communication efficiency; 3. ch4 network library optimizes MPI shared memory communication; 4. Improved MPI collective communication algorithms; 5. DMAPP library is optimized for specific collective communications, but with many limitations; 6. Addressed the challenges of large-scale all-to-all collective communications; 7. One-sided message API achieves decoupling; 8. Optimized data movement and process synchronization	1. Pure demonstrates excellent performance across all collective communications and load sizes; 2. Provides advanced communication-computation overlap mechanisms, surpassing traditional one-sided message API
MPI Multithreading	1. Supports multithreading within ranks in MPI_THREAD_MULTIPLE mode; 2. Most MPI implementations achieve thread safety through global locks, leading to performance bottlenecks; 3. MPI 4.0 introduces MPI+X method to enhance multithreading support; 4. Introduces the concepts of MPI Fine-points and Endpoints to support threading	1. Pure emphasizes the importance of MPI calls in multithreaded code; 2. Provides a unified programming model to simplify the introduction of parallel tasks
AMPI	1. MPI compatible library based on Charm++; 2. Provides advanced parallel programming abstractions; 3. Achieves performance improvement with minimal code changes	1. Pure outperforms AMIP in practical tests due to its optimized message passing and collective communication, as well as more refined and low-overhead load balancing strategies; 2. Compared to the thread-based model of AMIP SMP, Pure offers more efficient parallel processing
PGAS language and parallel frameworks	1. PGAS language provides an abstraction of global memory address space; 2. Chapel and X10 extend the PGAS approach, supporting local and remote asynchronous tasks; 3. HPX adds distributed operation support to the modern C++ standard; 4. Legion as a data center parallel programming system; 5. Frameworks like Kokkos, STAPL, BCL provide an abstraction layer between applications and hardware	1. Similar to Pure, the PGAS model adopts the SPMD style to improve performance through locality reference; 2. Although these frameworks utilize modern C++ features, they usually require extensive rewriting of existing applications, whereas Pure offers a more direct optimization path

# Summary

For decades, message passing has been regarded as the standard model for parallel programming due to its relative simplicity and performance advantages. However, this paper demonstrates that message passing and shared memory are not incompatible. In fact, by designing appropriate libraries, shared memory can be fully utilized without sacrificing most of the advantages of message passing.

# Citation

# Keywords

# Summary

# 1. Introduction

# 2. Pure Usage Example

Example Pure code timeline diagram

# 3. Programming Model

# Message passing and collective communication operations

# Pure Task

# 4. Runtime System

# Rank initialization and mapping

# Spin-Steal Waiting Loop (SSW-Loop)

# Implementation Instructions

# Point-to-point communication

Pure Message Passing Strategy

# Collective communication

Sequenced Per-Thread Dropbox (SPTD)

# Task Scheduler

# Evaluation

# NAS DT benchmark results

# CoMD and miniAMR benchmarks

# Collective communication performance

# Related work

# Summary