Arm Performance Optimization: Scalable Vector Extension SVE

# 1. SVE Introduction

After the Neon architecture extension with a fixed 128-bit vector length instruction set, Arm designed the Scalable Vector Extension (SVE) as the next-generation SIMD extension for AArch64. SVE introduces the scalable concept, allowing flexible vector length implementations and providing a range of possible values in CPU implementations. The vector length can vary from a minimum of 128 bits to a maximum of 2048 bits, in increments of 128 bits. The SVE design ensures that the same application can run on different SVE-supporting implementations without recompiling the code. SVE enhances the architecture’s applicability to high-performance computing (HPC) and machine learning (ML) applications, which require very large amounts of data processing. SVE2 is a superset of SVE and Neon. SVE2 allows the use of more functional domains in data-level parallelism. SVE2 inherits the concepts, vector registers, and operation principles of SVE. SVE and SVE2 define 32 scalable vector registers. Chip partners can choose an appropriate vector length design implementation, with hardware varying between 128 bits and 2048 bits (in increments of 128 bits). The advantage of SVE and SVE2 is that only one vector instruction set uses scalable variables.

The SVE design philosophy allows developers to write and build software once, and then run the same binary on different AArch64 hardware with various SVE vector length implementations. The portability of the binary means developers do not need to know the vector length implementation of their system. This eliminates the need to rebuild the binary, making the software easier to port. In addition to scalable vectors, SVE and SVE2 also include:

per-lane predication
Gather Load/Scatter Store
Speculative Vectorization

These features help vectorize and optimize loops when dealing with large datasets.

The main difference between SVE2 and SVE lies in the functional coverage of the instruction set. SVE is specifically designed for HPC and ML applications. SVE2 extends the SVE instruction set to enable accelerated data processing in areas beyond HPC and ML. The SVE2 instruction set can also accelerate common algorithms used in the following applications:

Computer Vision
Multimedia
LTE Basic Processing
Genomics
In-memory database
Web Service
General software

SVE and SVE2 both support collecting and processing large amounts of data. SVE and SVE2 are not extensions of the Neon instruction set. Instead, SVE and SVE2 are redesigned to offer better data parallelism than Neon. However, the hardware logic of SVE and SVE2 covers the implementation of Neon hardware. When a microarchitecture supports SVE or SVE2, it also supports Neon. To use SVE and SVE2, the software running on that microarchitecture must first support Neon.

# 2. SVE Architecture Basics

This section introduces the basic architectural features shared by SVE and SVE2. Like SVE, SVE2 is also based on scalable vectors. In addition to the existing register file provided by Neon, SVE and SVE2 add the following registers:

32 scalable vector registers, Z0-Z31
16 scalable Predicate registers, P0-P15
- 1 First Fault Predicate register, FFR
Scalable Vector System Control Register, ZCR_ELx

# 2.1 Scalable Vector Registers

Scalable vector registers Z0-Z31 can be implemented in microarchitecture as 128-2048 bits. The lowest 128 bits are shared with Neon’s fixed 128-bit vectors V0-V31.

The image below shows scalable vector registers Z0-Z31:

Z Registers-2024-08-13 — Scalable Vector Registers Z0-Z31

Scalable Vector:

Can accommodate 64, 32, 16, and 8-bit elements
Supports integer, double precision, single precision, and half precision floating-point elements
The vector length can be configured for each exception level (EL)

# 2.2 Scalable Predicate Register

In order to control which active elements participate in operations, Predicate registers (abbreviated as P registers) are used as masks in many SVE instructions, which also provides flexibility for vector operations. The figure below shows the scalable Predicate registers P0-P15:

P Register-2024-08-12 — Scalable Predicate Registers P0-P15

The P register is typically used as a bitmask for data manipulation:

Each P register is 1/8 the length of a Z register
P0-P7 are used for loading, storing, and arithmetic operations
P8-P15 used for loop management
FFR is a special P register set by the first-fault vector load and store instructions, used to indicate the success of load and store operations for each element. FFR is designed to support speculative memory access, making vectorization easier and safer in many cases.

# 2.3 Scalable Vector System Control Register

The figure below shows the Scalable Vector System Control Register ZCR_ELx:

ZCR_Elx-2024-08-12 — Scalable Vector System Control Register ZCR_Elx

Scalable Vector System Control Register indicates SVE implementation features:

ZCR_Elx.LEN field is used for the vector length of the current and lower anomaly levels.
Most bits are currently reserved for future use.

# 2.4 SVE Assembly Syntax

The SVE assembly syntax format consists of an opcode, destination register, P register (if the instruction supports a Predicate mask), and input operands. The following instruction example will detail this format.

Example 1:

1
LDFF1D {<Zt>.D}, <Pg>/Z, [<Xn|SP>, <Zm>.D, LSL #3]

Among them:

<Zt> is the Z register, Z0-Z31
<Zt>.D and <Zm>.D specify the element type of the target and operand vectors, without needing to specify the number of elements.
<Pg> is the P register, P0-P15
<Pg>/Z is to zero the P register.
<Zm> specifies the offset for the Gather Load address mode.

Example 2:

1
ADD <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T>

Among them:

<Pg>/M is the merge P register.
<Zdn> is both the destination register and one of the input operands. The instruction syntax shows <Zdn> in both places for convenience. In the assembly encoding, for simplification, they are only encoded once.

Example 3:

1
ORRS <Pd>.B, <Pg>.Z, <Pn>.B, <Pm>.B

S is the new interpretation of the P register condition flags NZCV.
<Pg> controls the P register to act as a bitmask in the example operation.

# 2.5 SVE Architecture Features

SVE includes the following key architectural features:

per-lane predication

In order to allow flexible operations on selected elements, SVE introduces 16 P registers, P0-P15, to indicate valid operations on vector active channels. For example:

1
ADD Z0.D, P0/M, Z0.D, Z1.D

Add the active elements Z0 and Z1 and place the result in Z0. P0 indicates which elements of the operands are active and inactive. The M following P0 stands for Merging, meaning the inactive elements of Z0 will retain their initial values after the ADD operation. If Z follows P0, the inactive elements will be zeroed, and the inactive elements of the destination register will be zeroed after the operation.

Per-lane_Predication-2024-08-13 — Per-lane predication merging

If \Z is used, the inactive elements will be zeroed, and the inactive elements of the target register will be zeroed after the operation. For example

1
CPY Z0.B, P0/Z, #0xFF

Indicates that the signed integer 0xFF will be copied to the active channel of Z0, while the inactive channels will be cleared.

Per-lane_Predicate_Zeroing-2024-08-13 — Per-lane predication zeroing

Note
Not all instructions have the Predicate option. Additionally, not all Predicate operations have both merge and zeroing options. You must refer to the AArch64 SVE Supplement to understand the specification details of each instruction.

Gather Load and Scatter Store

The addressing modes in SVE allow vectors to be used as base addresses and offsets in Gather Load and Scatter Store instructions, which enables access to non-contiguous memory locations. For example:

1
2
3
LD1SB Z0.S, P0/Z, [Z1.S] // Gather Load signed bytes from memory addresses generated by the 32-bit vector base address Z1 into the active 32-bit elements of Z0.

LD1SB Z0.D, P0/Z, [X0, Z1.D] // Gather Load signed bytes from memory addresses generated by the 64-bit scalar base address X0 plus the vector index in Z1.D into the active elements of Z0.

The following example shows the load operation LD1SB Z0.S, P0/Z, [Z1.S], where P0 contains all true elements, and Z1 contains scattered addresses. After loading, the least significant byte of each element in Z0.S will be updated with data fetched from scattered memory locations.

gather-load_and_scatter_store_example-2024-08-13 — Gather-load and Scatter-store Example

Loop control and management of the P register driver

As a key feature of SVE, the P register not only flexibly controls individual elements of vector operations but also enables P register-driven loop control. P register-driven loop control and management make loop control efficient and flexible. This feature eliminates the overhead of extra loop heads and tails for processing partial vectors by registering active and inactive element indices in the P register. P register-driven loop control and management mean that in the subsequent loop iterations, only active elements will perform the intended operations. For example:

1
2
3
WHILEL0 P0.S, x8, x9  // Generate a predicate in P0, starting from the lowest numbered element, true when the incremented value of the first unsigned scalar operand X8 is less than the second scalar operand X9, then false until the highest numbered element.

B.FIRST Loop_start     // B.FIRST (equivalent to B.MI) or B.NFRST (equivalent to B.PL) is usually used to branch based on the test result of the above instruction, determining whether the first element of P0 is true or false as the condition to end or continue the loop.

Predicate-driver_loop_control_and_management_example-2024-08-13 — Example of loop control and management driven by P register

Vector partitioning for speculation in software management

Speculative loading can pose challenges for memory reading of traditional vectors, if errors occur in certain elements during the reading process, it is difficult to reverse the load operation and track which elements failed to load. Neon does not allow speculative loading. To allow speculative loading of vectors (e.g., LDRFF), SVE introduces the first-fault vector load instruction. To allow vector access across invalid pages, SVE also introduces the FFR register. When using the first-fault vector load instruction to load into an SVE vector, the FFR register updates with the success or failure result of each element’s load. When a load error occurs, FFR immediately registers the corresponding element, registers the remaining elements as 0 or false, and does not trigger an exception. Typically, the RDFFR instruction is used to read the FFR status. The RDFFR instruction ends iteration when the first element is false. If the first element is true, the RDFFR instruction continues iteration. The length of FFR is the same as the P vector. This value can be initialized using the SETFFR instruction. The following example uses LDFF1D to read data from memory, and FFR is updated accordingly:

1
LDFF1D Z0.D, P0/Z, [Z1.D, #0] // Use the first-fault behavior to gather doublewords from the memory address generated by vector base address Z1 plus 0, loading into the active elements of Z0. Inactive elements do not read device memory or trigger a fault, and are set to zero in the destination vector. A successful load from valid memory sets the corresponding element in the FFR to true. The first-fault load sets the corresponding element and the remaining elements in the FFR to false or 0.

Vector-partioning-for-software-managed-speculation-example-2024-08-13 — Example of Vector Partitioning for Software-Managed Speculation

Extended floating point and horizontal reduction

In order to allow efficient reduction operations in vectors and meet different precision requirements, SVE enhances floating-point and horizontal reduction operations. These instructions may have a sequential (low to high) or tree-based (pairwise) floating-point reduction order, where the order of operations may lead to different rounding results. These operations require a trade-off between reproducibility and performance. For example:

1
FADDA D0, P0/M, D1, Z2.D // Perform a floating-point addition strict-order reduction from the low to high elements of the source vector, accumulating the result into the SIMD&FP scalar register. This example instruction adds D1 to all active elements of Z2.D and stores the result into scalar register D0. Vector elements are processed in strict order from low to high, with scalar source D1 providing the initial value. Inactive elements in the source vector are ignored. FADDV performs a recursive pairwise reduction and stores the result into the scalar register.

Extended_Floating-poing-and-horizontal-reductions-example-2024-08-13 — Extended Floating-point and Horizontal Reductions Example

# 3. New Features of SVE2

This section introduces the features added by SVE2 to the Arm AArch64 architecture. To achieve scalable performance, SVE2 is built on SVE, allowing vectors to reach up to 2048 bits.

In SVE2, many instructions that replicate existing instructions in Neon have been added, including:

Converted Neon integer operations, for example, Signed Absolute Difference Accumulate (SAB) and Signed Halving Add (SHADD).
Converted Neon extensions, narrowing and paired operations, for example, Unsigned Add Long - Bottom (UADDLB) and Unsigned Add Long - Top (UADDLT).

The order of element processing has changed. SVE2 processes interleaved even and odd elements, while Neon processes the low half and high half elements of narrow or wide operations. The diagram below illustrates the difference between Neon and SVE2 processing:

transformed_neon_widen_narraow_pairwise_operations-2024-08-13 — Comparison of Transformed Neon Narrow or Wide Operations

Complex number operations, such as complex integer multiplication-accumulation with rotation (CMLA).
Multi-precision arithmetic, used for large integer arithmetic and cryptography, for example, carry-in long addition - bottom (ADCLB), carry-in long addition - top (ADCLT) and SM4 encryption and decryption (SM4E).

For backward compatibility, the latest architecture requires Neon and VFP. Although SVE2 includes some features of SVE and Neon, SVE2 does not preclude the presence of Neon on the chip.

SVE2 supports optimization for emerging applications beyond the HPC market, such as in machine learning (ML) (UDOT instructions), computer vision (TBL and TBX instructions), baseband networks (CADD and CMLA instructions), genomics (BDEP and BEXT instructions), and servers (MATCH and NMATCH instructions).

SVE2 enhances the overall performance of general-purpose processors in handling large volumes of data, without the need for additional off-chip accelerators.

# 4. Using SVE programming

This section introduces software tools and libraries that support SVE2 application development. This section also explains how to develop applications for targets that support SVE2, run the application on hardware that supports SVE2, and simulate the application on any Armv8-A hardware.

# 4.1 Software and Library Support

To build SVE or SVE2 applications, you must choose a compiler that supports SVE and SVE2 features.

GNU tools version 8.0+ supports SVE.
Arm Compiler for Linux Version 18.0+ supports SVE, Version 20.0+ supports SVE and SVE2.
Both GNU and Arm Compiler for Linux compilers support optimizing C/C++/Fortran code.
LLVM (open-source Clang) version 5 and above includes support for SVE, and version 9 and above includes support for SVE2. To find out which SVE or SVE2 features are supported by each version of the LLVM tools, please refer to the LLVM toolchain SVE support page .

Arm Performance Libraries are highly optimized for mathematical routines and can be linked to your applications. Arm Performance Libraries version 19.3+ supports SVE’s math library.

Arm Compiler for Linux is part of Arm Allinea Studio, including Arm C/C++ Compiler, Arm Fortran Compiler, and Arm Performance Libraries.

# 4.2 How to Program Using SVE2

There are several methods to write or generate SVE and SVE2 code. In this section, we will explore some of these methods.

To write or generate SVE and SVE2 code, you can:

Write SVE assembly code
Programming with SVE intrinsics
Automatic vectorization
Use SVE optimization library

Let’s take a closer look at these four options.

# 4.2.1 Write SVE assembly code

You can write SVE instructions as inline assembly in C/C++ code, or as a complete function in assembly source code. For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
```assembly
    .globl subtract_arrays // -- Begin function
    .p2align 2
    .type subtract_arrays, @function
    subtract_arrays: // @subtract_arrays
    .cfi_startproc
// %bb.0:
    orr w9, wzr, #0x400
    mov x8, xzr
    whilelo p0.s, xzr, x9
.LBB0_1: // =>This Inner Loop Header: Depth=1
    ld1w { z0.s }, p0/z, [x1, x8, lsl #2]
    ld1w { z1.s }, p0/z, [x2, x8, lsl #2]
    sub z0.s, z0.s, z1.s
    st1w { z0.s }, p0, [x0, x8, lsl #2]
    incw x8
    whilelo p0.s, x8, x9
    b.mi .LBB0_1
// %bb.2:
    ret
.Lfunc_end0:
    .size subtract_arrays, .Lfunc_end0-subtract_arrays
    .cfi_endproc

If you write functions that mix high-level language and assembly language, you must be familiar with the Application Binary Interface (ABI) standards updated for SVE. The Arm Architecture Procedure Call Standard (AAPCS) specifies data types and register allocation, and is most relevant to assembly programming. AAPCS requires:

Z0-Z7 and P0-P3 are used to pass scalable vector parameters and results.
Z8-Z15 and P4-P15 are callee-saved.
All other vector registers (Z16-Z31) may be corrupted by the called function, and the calling function is responsible for backing up and restoring them when necessary.

# 4.2.2 Using SVE Instruction Functions (Intrinsics)

SVE intrinsic functions are functions supported by the compiler that can be replaced with corresponding instructions. Programmers can directly call instruction functions in high-level languages such as C and C++. The ACLE (Arm C Language Extensions) for SVE defines which SVE intrinsic functions are available, their parameters, and their functionality. A compiler that supports ACLE can replace intrinsics with mapped SVE instructions during compilation. To use ACLE intrinsics, you must include the header file arm_sve.h, which contains a list of vector types and intrinsic functions (for SVE) that can be used in C/C++. Each data type describes the size and data type of the elements in the vector:

svint8_t svuint8_t
svint16_t svuint16_t svfloat16_t
svint32_t svuint32_t svfloat32_t
svint64_t svuint64_t svfloat64_t

For example, svint64_t represents a 64-bit signed integer vector, svfloat16_t represents a half-precision floating-point vector.

The following example C code has been manually optimized using SVE intrinsics:

1
2
3
4
5
6
7
8
// intrinsic_example.c
#include <arm_sve.h>
svuint64_t uaddlb_array(svuint32_t Zs1, svuint32_t Zs2)
{
    // widening add of even elements
    svuint64_t result = svaddlb(Zs1, Zs2);
    return result;
}

The source code that includes the arm_sve.h header file can use SVE vector types, just like data types can be used for variable declarations and function parameters. To compile the code using the Arm C/C++ compiler and target the Armv8-A architecture that supports SVE, use:

1
armclang -O3 -S -march=armv8-a+sve2 -o intrinsic_example.s intrinsic_example.c

This command generates the following assembly code:

1
2
3
4
5
6
// instrinsic_example.s
uaddlb_array:         // @uaddlb_array
    .cfi_startproc
// %bb.0:
    uaddlb z0.d, z0.s, z1.s
    ret

# 4.2.3 Automatic Vectorization

C/C++/Fortran compilers (for example, the native Arm Compiler for Linux for the Arm platform and the GNU compiler) support vectorization of C, C++, and Fortran loops using SVE or SVE2 instructions. To generate SVE or SVE2 code, choose the appropriate compiler options. For example, one option to enable SVE2 optimization using armclang is -march=armv8-a+sve2. If you want to use the SVE version of the library, combine -march=armv8-a+sve2 with -armpl=sve.

# 4.2.4 Using SVE/SVE2 to Optimize Libraries

Use libraries highly optimized for SVE/SVE2, such as Arm Performance Libraries and Arm Compute Libraries. Arm Performance Libraries contain highly optimized implementations of mathematical functions optimized for BLAS, LAPACK, FFT, sparse linear algebra, and libamath. To be able to link any Arm Performance Libraries function, you must install Arm Allinea Studio and include armpl.h in your code. To build applications using Arm Compiler for Linux and Arm Performance Libraries, you must specify -armpl=<arg> on the command line. If you are using GNU tools, you must include the Arm Performance Libraries installation path in the linker command line with -L<armpl_install_dir>/lib and specify the GNU option equivalent to the Arm Compiler for Linux -armpl=<arg> option, which is -larmpl_lp64. For more information, please refer to the Arm Performance Libraries Getting Started Guide.

# 4.3 How to run SVE/SVE2 programs

If you do not have access to SVE hardware, you can use models or simulators to run the code. You can choose from the following models and simulators:

QEMU: Cross-compilation and native models, supporting modeling on Arm AArch64 platforms with SVE.
Fast Models: Cross-platform models that support modeling on Arm AArch64 platforms with SVE running on x86-based hosts. Architecture Envelope Model (AEM) with SVE2 support is only available to major partners.
Arm Instruction Emulator (ArmIE): Runs directly on the Arm platform. Supports SVE and supports SVE2 from version 19.2+.

# 5. ACLE Intrinsics

# 5.1 ACLE Introduction

ACLE (Arm C Language Extensions) is used in C and C++ code to support Arm features through intrinsics and other characteristics.

ACLE (ARM C Language Extensions) extends the C/C++ language with Arm-specific features.
- Predefined macros: __ARM_ARCH_ISA_A64, __ARM_BIG_ENDIAN, etc.
- Internal functions: __clz(uint32_t x), __cls(uint32_t x), etc.
- Data types: SVE, NEON, and FP16 data types.
ACLE support for SVE uses ACLE for variable-length vector (VLA) programming.
- Almost every SVE instruction has a corresponding intrinsic function.
- Data type used to represent size-agnostic vectors used by SVE intrinsics.
Applicable scenarios for the following users:
- Users who wish to manually adjust SVE code.
- Users who wish to adapt or manually optimize applications and libraries.
- Users who need low-level access to Arm targets.

# 5.2 How to use ACLE

Include header files
- arm_acle.h: Core ACLE
- arm_fp16.h: Add FP16 data type.
  - The target platform must support FP16, i.e., march=armv8-a+fp16.
- arm_neon.h: Add NEON Intrinsics and data types.
  - The target platform must support NEON, i.e., march=armv8-a+simd.
- arm_sve.h: Add SVE Intrinsics and data types.
  - The target platform must support SVE, i.e., march=armv8-a+sve.

# 5.3 SVE ACLE

The first thing to do is to include the header files

1
#include <arm_sve.h>

VLA data type
- svfloat64_t, svfloat16_t, svuint32_t, etc.
- Naming convention: sv<datatype><datasize>_t
Prediction
- Merge: _m
Reset: _z
Uncertain: _x
Data type of P register: svbool_t
Use generics for function overloading, for example, the function svadd will automatically select the corresponding function based on the parameter type.
Function naming convention: svbase[disambiguator][type0][type1]...[predication]
- base refers to basic operations, such as add, mul, sub, etc.
- disambiguator is used to distinguish different variants of the same basic operation.
typeN specifies the type of vector and P register.
predication specifies the handling method for inactive elements.
- For example: svfloat64_t svld1_f64, svbool_t svwhilelt_b8, svuint32_t svmla_u32_z, svuint32_t svmla_u32_m

# 5.4 Common SVE Intrinsics

Predicate
- Predicate is a vector of type bool, used to control whether the corresponding position in the vector participates in the computation during the process.
svbool_t pg = svwhilelt_b32(i, num) generates a predicate for (i, i + 1, i + 2, …, i + vl - 1) < num
- svbool_t pg = svptrue_b32() generates a predicate that is all true
- Among them, b32 corresponds to processing 32-bit data (int/float), in addition, there are also intrinsics corresponding to b8, b16, b64.
Memory data access
- svld1(pg, *base): Load contiguous vector from address base.
- svst1(pg, *base, vec): Store the vector vec into the address base.
svld1_gather_index(pg, *base, vec_index): Load the data corresponding to the vector index from the address base.
svst1_scatter_index(pg, *base, vec_index, vec): Store data from vector vec to the positions corresponding to the vector indices.
Basic calculation
- svadd_z(pg, sv_vec1, sv_vec2)
- svadd_m(pg, sv_vec1, sv_vec2)
- svadd_x(pg, sv_vec1, sv_vec2)
- svadd_x(pg, sv_vec1, x)
- Among them, _z indicates setting the position where pg is false to zero, _m indicates retaining the original value, and _x indicates uncertainty (any value is possible).
The second operand can be scalar data.
svmul, svsub, svsubr, svdiv, svdivr: Among them, svsubr swaps the position of the subtrahend and the minuend compared to svsub.
Others
svdup_f64(double x): Generate a vector with all elements being x.
- svcntd(): Returns the vector length of 64-bit data: svcntb corresponds to 8 bits, svcnth corresponds to 16 bits, svcntw corresponds to 32 bits.

# 5.5 SVE Structure Intrinsics

For corresponding structure data, SVE provides some special intrinsics, such as: svld3, svget3, svset3, svst3, etc. These intrinsics are used for processing structure data.

For example, for the particle structure:

1
2
3
4
5
typedef struct {
    float x;
    float y;
    float z;
} Particle;

You can use svld3 to load all the data in the structure as a group of 3 vectors, and then use svget3 to extract a vector from the group of 3 vectors, where the value of index 0, 1, 2 corresponds to x, y, z respectively.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
Particle *ps;
float factor = 2.2;
// Initialization part omitted
for (int i = 0; i < num; i += svcntw()) {
    svbool_t pg = svwhilelt_b32(i, num);
    svfloat32x3_t sv_ps = svld3(pg, (float32_t *)&ps[i]);
    svfloat32_t sv_ps_x = svget3(sv_ps, 0);
    svfloat32_t sv_ps_y = svget3(sv_ps, 1);

    // Perform calculation
    sv_ps_x = svmul_x(pg, sv_ps_x, factor);
    sv_ps_y = svmul_x(pg, sv_ps_y, factor);

    // Save results
    sv_ps = svset3(sv_ps, 0, sv_ps_x);
    sv_ps = svset3(sv_ps, 1, sv_ps_y);
    svst3(pg, (float32_t *)&ps[i], sv_ps);
}

svld3(pg, *base): Load all data in the structure as a group of 3 vectors; where base is the address of the 3-element structure array.
svget3(tuple, index): Extract a vector from a group of 3 vectors; the value of index is 0, 1, or 2.
svset3(tuple, index, vec): Set one vector in a group of 3 vectors; the value of index is 0, 1, or 2.
svst3(pg, *base, vec): Store a group of 3 vectors into a structure; where base is the address of an array of structures with 3 elements.

# 5.6 SVE Condition Selection

SVE provides methods such as svcmplt, svcompact, svcntp_b32, etc., which can select elements to retain in the vector based on conditions.

For example, for non-vectorized code:

1
2
3
4
5
6
7
8
9
for (int i = 0; i < num; i++) {
    float tmp = provided[i];
    if (tmp < mark) {
        selected[count++] = tmp;
        if (count >= maxSize) {
            break;
        }
    }
}

The purpose of this code is to select elements from the provided array that are less than mark and store them in the selected array until the selected array is full.

Rewrite with SVE Intrinsic:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
for (int i = 0; i < num; i += svcntw()) {
    svbool_t pg = svwhilelt_b32(i, num);
    svfloat32_t sv_tmp = svld1(pg, &provided[i]);
    svbool_t pg_sel = svcmplt(pg, sv_tmp, mark);
    sv_tmp = svcompact(pg_sel, sv_tmp);
    svst1(pg, &selected[count], sv_tmp);
    count += svcntp_b32(pg, pg_sel);
    if (count >= maxSize) {
        break;
    }
}

svcmplt(pg, vec1, vec2): Compare the size of two vectors, returning a predicate indicating the positions in vec1 that are less than vec2.
svcompact(pg, sv_tmp): Compress the vector, move the data with pg as active to the lower positions of the vector in order, and set the remaining positions to zero.
svcntp_b32(pg, pg2): Returns the number of active elements in pg2
This code first loads the data from the provided array into sv_tmp, then uses svcmplt to generate a predicate indicating the positions less than mark. Next, it uses svcompact to compress sv_tmp, obtaining the data less than mark, and then stores it into the selected array using svst1. Finally, it uses svcntp_b32 to count the number of active elements and update count.

compact-2024-08-13 — svcompact schematic diagram (256-bit vector)

Due to the compact operation, the selected array stores new data less than mark continuously from the count position, and the remaining positions are set to zero.

svst1-2024-08-13 — svst1 schematic diagram (256-bit vector)

# 5.7 SVE Vectorized Loop Interleaving

The vectorized loop interleaving implemented by SVE Intrinsic can greatly reduce the number of times vectors are read compared to compiler auto vectorization.

For example, for non-vectorized code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
for (int j = offset; j < outerLen - offset; j++) {
    int m2index = (j - offset) * innerLen;
    int m1index = m2index + innerLen;
    int m0index = m1index + innerLen;
    int p1index = m0index + innerLen;
    int p2index = p1index + innerLen;
    for (int i = 0; i < innerLen; i++) {
        res[m0index + i] = m2factor * field[m2index + i] +
                           m1factor * field[m1index + i] +
                           m0factor * field[m0index + i] +
                           p1factor * field[p1index + i] +
                           p2factor * field[p2index + i];
    }
}

After the compiler automatically vectorizes the code, each iteration requires reading data from five different vectors, resulting in low efficiency.

Rewrite with SVE Intrinsic:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
for (int i = 0; i < innerLen; i += svcntd()) {
    svbool_t pg = svwhilelt_b32(i, innerLen);
    int dataIndex = i;
    svfloat64_t jm2Field = svld1(pg, &field[dataIndex]);
    dataIndex += innerLen;
    svfloat64_t jm1Field = svld1(pg, &field[dataIndex]);
    dataIndex += innerLen;
    svfloat64_t jm0Field = svld1(pg, &field[dataIndex]);
    dataIndex += innerLen;
    svfloat64_t jp1Field = svld1(pg, &field[dataIndex]);

    for (int j = offset; j < outerLen - offset; j += 1) {
        svfloat64_t jp2Field = svld1(pg, &field[(j + offset) * innerLen + i]);
        svfloat64_t svRes = svmul_x(pg, jm2Field, m2factor);
        svRes = svmad_x(pg, jm1Field, m1factor, svRes);
        svRes = svmad_x(pg, jm0Field, m0factor, svRes);
        svRes = svmad_x(pg, jp1Field, p1factor, svRes);
        svRes = svmad_x(pg, jp2Field, p2factor, svRes);
        svst1(pg, &res[j * innerLen + 1], svRes);
        jm2Field = jm1Field;
        jm1Field = jm0Field;
        jm0Field = jp1Field;
        jp1Field = jp2Field;
    }
}

svmad_x(pg, vec1, vec2, vec3): Calculates vec1 * vec2 + vec3, returns a vector.
This code only needs to read one vector per iteration, greatly reducing the number of vector reads.

# 1. SVE Introduction

# 2. SVE Architecture Basics

# 2.1 Scalable Vector Registers

Scalable Vector Registers Z0-Z31

# 2.2 Scalable Predicate Register

Scalable Predicate Registers P0-P15

# 2.3 Scalable Vector System Control Register

Scalable Vector System Control Register ZCR_Elx

# 2.4 SVE Assembly Syntax

# 2.5 SVE Architecture Features

Per-lane predication merging

Per-lane predication zeroing

Gather-load and Scatter-store Example

Example of loop control and management driven by P register

Example of Vector Partitioning for Software-Managed Speculation

Extended Floating-point and Horizontal Reductions Example