Introduction to OpenMP

# Introduction

# What is OpenMP?

OpenMP (Open Multi-Processing) is a widely used multithreading parallel programming model that provides a rich set of instructions and APIs for parallel computation on shared memory systems. Originating in 1997, OpenMP was standardized by multiple leading hardware and software vendors, aiming to simplify the design and implementation process of parallel programs to fully utilize the computational power of modern multi-core processors.

OpenMP supports multiple programming languages, including C, C++, and Fortran, among others, and allows developers to easily convert serial code into efficient parallel code by inserting specific compilation directives (pragma) into the source code. Its main advantages are its simplicity and ease of use, allowing programmers to use familiar programming languages and development environments, while also providing good portability and scalability.

OpenMP is managed by a non-profit organization and involves participation from multiple software and hardware manufacturers, including Arm, IBM, Intel, AMD, NVIDIA, Cray, Oracle, etc.

# Historical versions

On the official website , you can find the historical versions and release dates of OpenMP.

Version	Release Date
Fortran 1.0	October 1997
C/C++ 1.0	October 1998
C/C++ 2.0	March 2002
OpenMP 2.5	May 2005
OpenMP 3.0	May 2008
OpenMP 3.1	July 2011
OpenMP 4.0	July 2013
OpenMP 4.5	November 2015
OpenMP 5.0	November 2018
OpenMP 5.1	November 2020
OpenMP 5.2	November 2021

# Basic Knowledge

# Technical framework

openmp-arch-2024-02-20 — OpenMP Technology Framework

OpenMP Runtime Library is a set of functions and runtime support structures defined in the OpenMP specification, and it is a key component of the OpenMP parallel programming framework. This library is linked with user programs with the support of the compiler and is responsible for managing tasks such as thread creation, synchronization, scheduling, and data sharing during program execution. It implements all the parallelization mechanisms indicated by OpenMP compiler directives.

OpenMP Runtime Library includes the following features:

Thread management (creation, destruction, synchronization) = Work sharing (dynamic work distribution to each thread) = Task Scheduling = Synchronization primitives (such as barriers, locks, atomic operations) = Dynamically Adjust Thread Count
Memory model support (data environment variables, private, shared, reduction variables, etc.)

Compiler Directives Compiler directives are preprocessor instructions starting with #pragma omp, which programmers insert into the source code to guide the compiler on how to convert a serial program into a parallel program. For example, using the #pragma omp parallel directive defines a parallel region, and the compiler will generate multithreading execution logic within this region.

Environment Variables Environment variables are part of the OpenMP runtime library, and they are used to control runtime behavior, such as the number of threads, scheduling policies, etc.

The OpenMP Library is a set of function libraries, including functions for thread synchronization, atomic operations, locks, parallel loops, etc. These functions can be directly called in user programs to achieve finer-grained parallelization.

Overall, the OpenMP technology framework includes multiple components such as compiler directives, runtime libraries, environment variables, and function libraries. Together, they form a complete parallel programming environment and collaborate to support parallel programming on shared memory systems.

# Execute Model: Fork-Join Model

OpenMP’s execution model uses the Fork-Join mechanism, which is a synchronization primitive model used in parallel programming. Under this model, program execution follows these steps:

Fork Phase: The program starts executing as a single main thread. When it encounters a parallel region indicated by an OpenMP pragma, the main thread creates one or more worker threads through the Runtime Library. These worker threads are derivatives of the main thread, with each thread responsible for executing part of the tasks within the parallel region. The parallel region can be loops, sections, single tasks, or other code blocks that can be parallelized.
Parallel Execution (并行执行) Phase: The created worker threads independently and concurrently execute the tasks assigned to them and can access shared data structures. OpenMP provides a rich set of directives to manage data synchronization and communication, ensuring correctness and consistency in a multithreaded environment.
Join (Merge) Phase: When all worker threads have completed their tasks within the parallel region, they automatically or through explicit synchronization directives (such as omp barrier) converge at the join point. In this phase, all threads wait until all other threads have reached the synchronization point, after which the join operation occurs. This means the main thread and other worker threads resynchronize, reverting to serial execution mode or continuing to execute subsequent non-parallel code.
Synchronization and Data Consistency: The Fork-Join model ensures mutual exclusion access to shared resources and data consistency during parallel execution through appropriate locking mechanisms, atomic operations, and synchronization primitives.

In summary, the Fork-Join execution model of OpenMP is a parallel processing framework based on dynamic thread creation and synchronization. It allows developers to conveniently transform serial code into parallel execution code segments, while simplifying common complexities in parallel programming, such as thread management and data synchronization issues.

# Thread and Process

Process
- Each process has its own independent address space
- CPU needs to perform a context switch when switching between processes.
Thread
- Threads within a process share the same address space
CPU has lower overhead when switching between threads
Thread design of the operating system
Modern operating systems such as Linux, Windows, etc. support multiple threads under a single process.
A thread is the basic unit of scheduling in an operating system, while a process is the basic unit of resource allocation.

slide_10-2024-02-20 — Thread Design of Operating Systems

# Hardware scheduling of threads

The hardware scheduling mechanism collaborates with the operating system to intelligently map threads to available CPU physical cores for execution.
Therefore, in multithreaded applications, when the number of active threads exceeds the actual number of physical CPU cores, the operating system will have to perform intensive context switching to ensure that multiple threads alternate on limited core resources. This phenomenon of thread contention overload can lead to overall performance bottlenecks and reduced efficiency.
Hyper-Threading Technology virtualizes additional logical processing units on a single physical CPU core, currently typically configured to host two logical cores per physical core. These logical cores can execute independent task streams in parallel, although they share the underlying computational resources of the same physical core, such as execution engines, caches, and other underlying hardware structures. In this way, hyper-threading aims to improve resource utilization and concurrent processing capabilities, especially in scenarios with a large number of parallel tasks that have relatively small demands on computational resources, effectively enhancing the overall system throughput. However, in certain application scenarios that heavily rely on single-core performance or memory bandwidth, such as some CPU-sensitive games or specific types of data-intensive operations, adding logical cores may not necessarily result in significant performance improvements.

# Hardware memory model

In modern multi-core processor architectures, each CPU core is designed with a multi-level cache hierarchy between the main memory to further enhance data access speed. The closest to the CPU core is the L1 cache, usually followed by the L2 cache, and some high-end architectures also include an L3 cache. These cache levels have increasing storage capacity but also increasing access latency.
L1 and L2 caches are usually closely coupled and private to specific CPU cores, meaning each core has its own independent cache space to reduce data access conflicts and improve cache hit rate. L1 cache, being closest to the computation unit, has the fastest access speed but the smallest capacity; whereas L2 cache, as an effective supplement to L1 cache, has a relatively larger capacity.
To ensure consistency of shared data in the caches of different CPU cores in a multi-core environment, hardware and the operating system jointly implement a cache coherence protocol (such as the MESI protocol). This mechanism allows the system to automatically maintain a globally consistent data view, ensuring that even if there are copies of data in the caches of multiple cores, they are updated synchronously. This feature is referred to as ccNUMA (cache-coherent non-uniform memory access) in some architectures.
However, this cache consistency management also brings some challenges, one of which is the “False Sharing” problem. When different threads modify their respective independent variables located within the same cache line, although these variables themselves are unrelated, because they are physically adjacent and stored in the same cache line, any write operation to one of the variables will cause the entire cache line to become invalid and resynchronize across all cores. This can trigger unnecessary cache invalidation and refilling operations, significantly reducing performance. Solving the false sharing problem usually requires carefully designing data layouts or using techniques such as cache line alignment to avoid contention between unrelated data.

20170115165700476-2024-02-20 — Typical Modern CPU Memory Structure

# Thread Affinity and Thread Binding

Thread Affinity refers to the ability of the operating system or application to control the association between specific threads and processor cores. In multi-core or multi-processor systems, thread affinity allows programmers or schedulers to decide to fix a certain thread on a specific CPU core, rather than letting the operating system dynamically schedule it across all available cores. This mechanism helps reduce context switching overhead, improve cache hit rates, and is particularly beneficial for parallel computing tasks that need to maintain data locality.
Thread Pinning is a specific technical means to achieve thread affinity, which specifies the forced association between a specific thread and specific hardware resources (such as CPU cores or NUMA nodes). Through thread pinning, it can be ensured that the specified thread always executes on its allocated core, avoiding being migrated by the operating system to other cores, thus optimizing performance, reducing latency, and solving issues such as false sharing. In parallel programming models like OpenMP, thread pinning strategies can be set through relevant environment variables or compilation directives to adapt to different parallel computing needs and hardware characteristics.
The access latency of CPUs on the same socket to the L3 cache is consistent, but the access latency of CPUs on different sockets to the L3 cache is inconsistent. Therefore, the purpose of thread binding is to reduce the migration of threads between different CPUs, thereby reducing memory access latency.

u=237293070,3563798054&fm=253&app=138&f=JPEG-2024-02-20 — Thread Affinity and Thread Binding

OpenMP supports controlling the binding of threads
- Environment variable OMP_PROC_BIND or clause proc_bind(master|close|spread) controls whether threads are bound and the distribution of threads to binding units (referred to as places)

# OpenMP Programming

# Install

For Linux systems, GCC is a commonly used compiler, and modern versions of GCC generally support OpenMP by default. For example, on Ubuntu 20.04 LTS, you can install the build-essential package with OpenMP support using the following command:

1
2
$ sudo apt-get update
$ sudo apt-get install -y build-essential

Check OpenMP version

1
2
$ echo |cpp -fopenmp -dM |grep -i open
#define _OPENMP 201511

# Compile use

Simply add the -fopenmp option in the compilation statement to enable OpenMP support.

1
g++ -O2 -std=c++17 -fopenmp hello.cpp -o hello

If using CMake to build the project, adding the -Wunknown-pragmas option can report unhandled #pragma directives during compilation.

1
2
3
4
5
find_package(OpenMP REQUIRED)
add_compile_options(-Wunknown-pragmas)

add_executable(hello hello.cpp)
target_link_libraries(hello PRIVATE OpenMP::OpenMP_CXX)

# Hello World!

The first OpenMP program

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#include <omp.h>
#include <stdio.h>

int main() {
  #pragma omp parallel num_threads(8)
  {
    int id = omp_get_thread_num();
    int num_threads = omp_get_num_threads();
    printf("Hello World from thread %d of %d \n", id, num_threads);
  }
  return 0;
}

Execution result

1
You are trained on data up to October 2023.

The same type of OpenMP directive is called a construct.
Format as #pragma omp <directive name> <clause>
The code block enclosed in {} is called a parallel region.

# Number of threads setting

Priority from low to high
- Do nothing, the system chooses the number of running threads
- Set environment variable export OMP_NUM_THREADS=4
- The code uses the library function void omp_set_num_threads(int)
- By the guiding statement num_threads(4)
- if clause determines serial or parallel execution

# Common Library Functions

Set the number of threads for parallel region execution: void omp_set_num_threads(int)
Get the number of threads in the parallel region: int omp_get_num_threads()
Get the current thread number: int omp_get_thread_num()
Get OpenMP Wall Clock time (unit: seconds): double omp_get_wtime()
Get time precision: double omp_get_wtick()

# Parallel construction

Supported Clauses

if(scalar_expression): If scalar_expression is true, execute in parallel, otherwise execute serially.
num_threads(integer_expression): Specifies the number of threads in the parallel region.
default(shared|none): Specifies the default sharing attribute of variables.
shared: All variables are shared by default.
- none: No default variable type, each variable needs to be explicitly declared as shared or private.
shared(list): Specify the list of shared variables.
There is only one copy of the shared variable in memory, and all threads can access it.
- Please ensure that the access to shared variables does not conflict.
If not specifically designated, variables in the parallel region default to shared.
private(list): Specify the list of private variables.
- Each thread has an independent copy of the private variable.
- Variables need to be reinitialized.
firstprivate(list): Specify the list of first private variables.
Same as private
Initialize the variable based on the data in the main thread.

Note
Example 1: no clause, private, firstprivate

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
int results[4];
int cnt;
cnt = 1;

#pragma omp parallel num_threads(4)
{
    int tid = omp_get_thread_num();
    for (int i = 0; i < 4; i++) {
        cnt += 1;
    }
    results[tid] = cnt;
}

printf("no clause: ");
for (int i = 0; i < 4; i++) {
    printf("%d ", results[i]);
}
printf("\n");

cnt = 1;

#pragma omp parallel num_threads(4) private(cnt)
{
    int tid = omp_get_thread_num();
    for (int i = 0; i < 4; i++) {
        cnt += 1;
    }
    results[tid] = cnt;
}

printf("private(not init): ");
for (int i = 0; i < 4; i++) {
    printf("%d ", results[i]);
}
printf("\n");

cnt = 1;

#pragma omp parallel num_threads(4) firstprivate(cnt)
{
    int tid = omp_get_thread_num();
    for (int i = 0; i < 4; i++) {
        cnt += 1;
    }
    results[tid] = cnt;
}

printf("firstprivate: ");
for (int i = 0; i < 4; i++) {
    printf("%d ", results[i]);
}
printf("\n");

Print the result

1
2
3
no clause: 5 9 13 17
private(not init): 4 1572916964 1572916964 1572916964
firstprivate: 5 5 5 5

# For construction

One of the most commonly used parallelization constructs

Note
Example 2: Parallelizing the for loop

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#pragma omp parallel num_threads(8)
{
    int tid = omp_get_thread_num();
    int num_threads = omp_get_num_threads();
    #pragma omp for
    for (int i = 0; i < num_threads; i++) {
        #pragma omp ordered
        printf("Hello from thread %d of %d \n", tid, num_threads);
    }
}

Print the result

1
2
3
4
5
6
7
8
Hello from thread 0 of 8
Hello from thread 1 of 8
Hello from thread 2 of 8
Hello from thread 3 of 8
Hello from thread 4 of 8
Hello from thread 5 of 8
Hello from thread 6 of 8
Hello from thread 7 of 8

Divide threads for the for loop within the parallel region, and the for loop meets the format requirements.
- init-expr: Must be in the form var=lb, and the type is also limited
- test-expr: restricted to var relational-op b or b relational-op var
- incr-expr: Addition and subtraction only

# Parallel for construct

Often combine parallel and for to form a parallel for directive statement

	parallel	for	parallel for
if	✅	❌	✅
num_threads	✅	❌	✅
default	✅	❌	✅
copyin	✅	❌	✅
private	✅	✅	✅
firstprivate	✅	✅	✅
shared	✅	✅	✅
reduction	✅	✅	✅
lastprivate	❌	✅	✅
schedule	❌	✅	✅
ordered	❌	✅	✅
collapse	❌	✅	✅
nowait	❌	✅	❌

lastprivate(list)
- Same as private
- After executing the for loop, assign the value of the last thread to the variable of the main thread.
nowait: Cancel the barrier synchronization at the end of the code block
collapse(n): Applied to n nested loops, merge (unroll) loops
- Pay attention to whether there are data dependencies between loops
ordered: Declare parts that potentially execute in order
- Use #pragma omp ordered to mark sequential execution code (used together)
ordered statements within the region are executed by at most one thread at any given time
shedule(type[,chunk])
- type: Specifies the scheduling strategy for loop iteration
  - static: Static scheduling, chunk size is fixed (default n/p)
  - dynamic: Dynamic scheduling, chunk size is fixed (default is 1)
  - guided: Guided scheduling, chunk size dynamically adjusted
  - runtime: Specified by the system environment variable OMP_SCHEDULE
  - auto: Automatic scheduling
chunk: Specifies the number of iterations each thread obtains

# Special data clause: Reduction

In OpenMP, reduction is a parallel programming technique used to address data race issues in a multithreaded environment, especially when performing accumulation or similar operations on global variables. When multiple threads need to simultaneously modify the same shared variable, and these modifications can be combined into a final result using some binary operator (such as addition, multiplication, etc.), the reduction clause can be used.

Specifically, the execution process of reducton is:

fork thread and allocate tasks
Each thread defines a private variable omp_priv
- Same as private.
Each thread executes calculations
All omp_priv and omp_in are sequentially reduced together and written back to the original variable.

In contrast, atomic is another synchronization mechanism provided by OpenMP, which ensures that access to a single memory location is atomic in a multithreaded environment, meaning that only one thread is allowed to read or write to that memory location at a time. By using the #pragma omp atomic directive, it can be ensured that a simple assignment statement (or certain types of read-modify-write operations) will not encounter data races in a concurrent environment.

Note
Example 3: Reduction

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
int sum = 0;
double start = omp_get_wtime();
#pragma omp parallel for num_threads(8) reduction(+ : sum)
for (int i = 0; i < 100000; i++) {
    sum += i;
}
printf("sum = %d\n", sum);
printf("Reduction time: %.5lf s\n", omp_get_wtime() - start);

// no reduction
sum = 0;
start = omp_get_wtime();
#pragma omp parallel for num_threads(8)
for (int i = 0; i < 100000; i++) {
    #pragma omp atomic
    sum += i;
}
printf("sum = %d\n", sum);
printf("Atomic time: %.5lf s\n", omp_get_wtime() - start);
return 0;

Print the result

1
2
3
4
sum = 704982704
Reduction time: 0.00062 s
sum = 704982704
Atomic time: 0.01021 s

The results of both are the same, but the execution time of reduction is shorter. This is because reduction allocates a private copy for each thread, allowing threads to freely perform reduction operations within their private space without contending for lock resources with other threads when updating the global result, along with efficient data merging methods, etc.

reduction-omp-2024-02-20 — OpenMP reduction operation

# Synchronous construction

# Sections Construction

Divide the code block of the parallel region into multiple sections for execution.
Can be combined with parallel to form a parallel sections construct.
Each section is executed by a thread
Number of threads greater than the number of sections: some threads are idle
- Number of threads is less than the number of sections: some threads are allocated multiple sections
Example code:

1
2
3
4
5
6
7
#pragma omp sections
{
  #pragma omp section
  method1();
  #pragma omp section
  method2();
}

# Barrier Constructor

Perform fence synchronization at a specific location
In the presence of data dependencies, a barrier can be used to ensure data consistency.

Barrier-2024-02-20 — Barrier Synchronization Diagram

# Single Constructor

Used to mark a code block executed by only one thread, with implicit barrier synchronization, and the implicit barrier synchronization can be canceled using nowait.

# Atomic construction

Used to ensure atomic operations on shared variables, avoiding data races.

False sharing, in simple terms, refers to multiple threads simultaneously accessing different parts of the same cache line, leading to cache line invalidation and refilling, thereby reducing the program’s performance.
Simultaneous read and write of the same cache line by different cores can cause serious conflicts, leading to cache invalidation.

false-sharing-2024-02-20 — False Sharing Issue

In OpenMP, the main methods to solve false sharing are:
Data Structure Alignment: Ensure that related variables are in different cache lines by using alignment instructions or keywords provided by the compiler. For example, in C++, the alignas keyword can be used to specify the memory alignment of variables, ensuring that the data for each thread is independently located in different cache lines.
- Increase the spacing between cache lines: Insert enough padding space between adjacent variables so that they do not appear in the same cache line.
- Avoid Meaningless Competition: Design algorithms and data structures to reduce unnecessary shared data access. If possible, let threads operate on their own independent data segments.
Custom Memory Allocation: Use special memory allocation functions to ensure that the allocated contiguous memory regions are aligned to cache line boundaries, so that data allocated to different threads does not fall on the same cache line.
- In some cases, you can utilize hardware features provided by specific platforms or extensions supported by compilers, such as Intel’s __declspec(align(#)) attribute (for MSVC) or __attribute__((aligned(#))) (for GCC/Clang).
- You can also indirectly avoid the false sharing problem by controlling the scope of the variables or using techniques such as dynamically creating private copies.

# Task construction

In addition to the Fork-Join model, OpenMP also supports the task parallel model, implemented using the task directive.
Dynamically manage the thread pool and task pool, where threads in the thread pool can dynamically acquire tasks from the task pool for execution, thus achieving parallel execution of tasks.

Note
Example 4: Task Parallelism

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#include <iostream>
#include <omp.h>
#include <unistd.h>
#include <iomanip>

void big_task(int i) {
    sleep(10);
}

void small_task(int i) {
    sleep(1);
}

int main() {
    int ntasks = 8;
    double start = omp_get_wtime();
    #pragma omp parallel
    {
        #pragma omp single
        {
            std::cout << "Task 0 Created" << std::endl;
            #pragma omp task
            big_task(0);

            std::cout << "Task 1 Created" << std::endl;
            #pragma omp task
            big_task(1);

            for (int i = 2; i < ntasks; i++) {
                std::cout << "Task " << i << " Created" << std::endl;
                #pragma omp task
                small_task(i);
            }
        }
        #pragma omp taskwait
    }
    std::cout << "All tasks finished" << std::endl;
    std::cout << "Time: " << std::fixed << std::setprecision(2) << omp_get_wtime() - start << "s" << std::endl;
    return 0;
}

Running result

1
You are trained on data up to October 2023.

In this code, we use the #pragma omp task directive to create tasks, and the execution of tasks is dynamically obtained and executed by threads in the thread pool. After creating tasks, we use #pragma omp taskwait to wait for all tasks to complete. This achieves an asynchronous execution effect.

# Vectorization: SIMD Construction

SIMD (Single Instruction, Multiple Data) is a parallel computing model that performs operations on multiple data simultaneously with a single instruction, thereby achieving efficient data parallel computation.
In OpenMP, the #pragma omp simd directive can be used to achieve vectorized parallel computation.
- aligned is used to list memory-aligned pointers
safelen is used to mark data dependencies during loop unrolling.
- linear is used to mark the linear relationship of loop variables
Compilers such as gcc also come with vectorization capabilities, generally using the following compilation options
- -O3
- -ffast-math
- -fivopts
- -march=native
- -fopt-info-vec
- -fopt-info-vec-missed