RDMA: Memory Window

This article welcomes non-commercial reprints, please indicate the source when reprinting.

Statement: For collection only, for easier reading.
― Savir, Zhihu Column: 14. RDMA Memory Window

This article is the 14th in the “RDMA Talk” column. Welcome to repost, please indicate the source when reposting.

In the article 【RDMA Memory Region】 , we introduced Memory Region, which is a special memory area registered by the user: on one hand, its contents will not be swapped to the hard disk, and on the other hand, the RDMA network card records its address translation relationship, allowing the hardware to find the corresponding physical address after obtaining the virtual address specified by the user in the WR.

In this article, we will explain the concept of Memory Window, which is a more flexible memory management unit based on Memory Region. Besides the concept of MW, this article will also provide a more detailed introduction to some memory-related concepts in the RDMA field, such as L_Key/R_Key, etc. It is recommended to read this article in conjunction with 【RDMA Memory Region】 for better understanding, and it is suggested that readers review it first.

# What is Memory Window

Memory Window, abbreviated as MW, can be translated into Chinese as 内存窗口. It is an RDMA resource requested by the user to allow a remote node to access the local memory area. Each MW is bound (referred to as bind) to an already registered MR, but compared to MR, it can provide more flexible permission control. MW can be roughly understood as a subset of MR, and many MWs can be divided from one MR, each MW can set its own permissions. The relationship between MW and MR is shown in the following diagram:

2024-06-28_12_1 — The relationship between MR and MW

# Memory access permission control

To explain why MW is designed, let’s first discuss the access control involved in both MR and MW.

# MR/MW permissions configuration

The permissions here refer to the local/remote node, for the read/write permissions of the local memory, they form four combinations:

	Local End	Remote End
Read	Local Read	Remote Read
Write	Local Write	Remote Write

Apart from these four types of permissions, there are also Atomic permissions, etc., which are not within the scope of this article.

Among the four types of permissions in the table, the lowest is Local Read, which is a permission that users must grant to MR/MW because if a piece of memory is inaccessible to local users, it loses its meaning. Additionally, there is a restriction: if an MR needs to be configured with Remote Write or the not-yet-introduced Remote Atomic permissions, it must also be configured with Local Write permissions. Under this constraint, each MR or MW can configure permissions as needed. For example, if an MR we registered needs to allow remote nodes to write data but not read, we enable the Remote Write permission and disable the Remote Read permission. In this way, when the HCA (network card) receives a WRITE request initiated by the peer for a certain address within the range of this MR, it can allow it; however, when the HCA receives a READ operation from the peer on this MR, it will reject the request and return an error message to the peer.

# Memory Key

The above access permission configuration cannot prevent malicious users from accessing local or remote memory. For example, if a node grants Remote Write permission to a memory region, wouldn’t any remote node (process) be able to write to this region as long as it provides the correct address information? Therefore, the IB specification designed the Memory Key, which can be simply understood as a key mechanism for accessing MR. Only with the correct key can one open the door to MR/MW.

Key is a string of numbers, consisting of two parts: a 24-bit Index and an 8-bit Key:

2024-06-28_12_2 — Composition of L_Key/R_Key

Among them, Index is used by HCA for quick indexing to local virtual-to-physical address translation tables and other MR-related information, while Key is used to verify the legality of the entire field to prevent unauthorized users from arbitrarily passing the Index.

Memory Key is divided into two types according to their usage, Local Key and Remote Key:

# L_Key

Local Key, associated with an MR, is used for HCA to access local memory. When a process on the local side attempts to use memory of an already registered MR, the HCA will verify the L_Key it passes. It uses the index in the L_Key to look up the address translation table, translates the virtual address into a physical address, and then accesses the memory.

In the article 【RDMA Shared Receive Queue】 , we described sge, which consists of a starting address, length, and key. When users fill out a WR, if they need the HCA to access the local memory, they need to describe the memory block through a linked list of sge (sgl). Here, the key in the sge is filled with L_Key, which are key1 and key3 in the diagram below, representing the L_Key of MR1 and MR2, respectively. Without L_Key, any local user process could direct the hardware to access the contents of other locally registered MRs, and the hardware would find it difficult to efficiently translate virtual addresses to physical addresses.

# R_Key

Remote Key, associated with an MR or MW, is used for a remote node to access local memory. When a remote node attempts to access local memory, on one hand, the local HCA will verify whether the R_Key is valid, and on the other hand, it will use the index in the R_Key to check the address translation table, translating the virtual address into a physical address and then accessing the memory.

For any RDMA operation (i.e., Write/Read/Atomic), the user must carry the remote memory region’s R_Key in the WR.

The IB specification ensures that MR can be accessed correctly and safely according to the user’s expectations through the two mechanisms mentioned above. We use a metaphor to summarize the content related to MR/MW permission control:

A equipped their room (MR) with two keys (Memory Key), one for personal use (L_Key), and the other key (R_Key) was sent to B (can be via any communication method). B can open the door when A is not home (the local CPU does not perceive the remote node’s RDMA operations on local memory) using the key (R_Key). After opening the door, B might only be able to view the room’s arrangement through glass (A only granted remote read permission for this MR), or enter the room and find it completely dark, unable to see anything, but can place items in the room (A only granted remote write permission for this MR), and of course, it is also possible that there’s no glass and the lights are on (remote read and write permissions were granted simultaneously).

# Why have MW

In short, the purpose of designing MW is to control remote memory access permissions more flexibly.

In the article 【RDMA 之 Memory Region】 , we introduced the process of user registering MR, which requires transitioning from user mode to kernel mode, calling the function provided by the kernel to pin the memory (to prevent paging), and then creating a virtual-physical address mapping table and issuing it to the hardware.

Because MR is managed by the kernel, if a user wants to modify the information of an existing MR, for example, if I want to revoke the remote write permission of a certain MR, leaving only the remote read permission; or if I want to invalidate an R_Key that was previously authorized to a remote node, the user needs to use the Reregister MR interface to make modifications. This interface is equivalent to first Deregister MR and then Register MR. The above process requires transitioning to kernel mode to complete, and this process is time-consuming.

Unlike MR, which requires permission modification through the control path, MW can be dynamically bound to an already registered MR through the data path (i.e., directly issuing WR to the hardware from user space) after creation, and simultaneously set or change its access permissions. This process is much faster than re-registering MR.

In order for a piece of memory to be capable of RDMA WRITE/READ operations by a remote node, we have two methods: registering an MR and registering an MW and then binding it to an already registered MR. Both will generate an R_Key to provide to the remote node. The first method has simpler preparation steps but is less flexible, and once registered, modifications are relatively troublesome. The second method involves additional steps of registering an MW and binding the MW to an MR compared to the first method, but it allows for convenient and quick control over remote access permissions.

# The relationship between MW and MR permissions

Perhaps some readers might think, when configuring their permissions during MR application, and when MW is bound to MR, their permissions are also configured, what is the relationship between these two permissions? The IB specification has a dedicated section on this in 10.6.7.2.2:

When binding a Memory Window, a Consumer can request any combination of remote access rights for the Window. However, if the associated Region does not have local write access enabled and the Consumer requests remote write or remote atomic access for the Window, the Channel Interface must return an error either at bind time or access time.

In summary, if you want to configure remote write or remote atomic operation (Atomic) permissions for MW, then the MR it is bound to must have local write permissions. In other cases, the permissions of the two do not interfere with each other: remote users using MW must follow the permission configuration of MW; remote users using MR must follow the permission configuration of MR.

# User Interface

As usual, when it comes to user interfaces, we classify them according to control paths and data paths:

# Control path

MW supports addition, deletion, and search, but cannot be directly modified:

# Create - Allocate MW

Apply for MW, mainly to create the software structure related to MW and prepare the hardware. The user needs to specify the type of MW introduced later in the text. This interface will generate a handle for the Memory Window, which the user can use to refer to this MW in the future.

Note that at this time MW is not bound to MR and is in a state that cannot be accessed remotely.

# Delete - Deallocate MW

Unregister MW. It’s easy to understand, just destroy the related resources.

# Query - Query MW

Query MW information, including R_Key and its status, MW type, and PD, etc.

It needs to be emphasized again that although this Verbs is described in the IB specification, the related API has not been implemented in the RDMA software stack. There are quite a few Verbs interfaces in similar situations. The RDMA software stack is based on practicality, and interfaces without user demand are generally not implemented.

# Data path

MW has a unique set of interfaces in the data path, divided into Bind and Invalidate categories:

# Bind

Bind(ing) means “binding,” which refers to associating an MW with a specified range of an already registered MR and configuring certain read and write permissions. The result of binding will generate an R_key, which the user can pass to a remote node for remote access. Note that an MW can be bound multiple times, and multiple MWs can be bound to a single MR. If an MR still has bound MWs, then this MR cannot be deregistered.

2024-06-28_12_5 — Bind's Software and Hardware Interaction

There are two ways to bind: one is to call the Post Send interface to issue a Bind MW WR, and the other is to call the Bind MW interface.

Post Send Bind MW WR

In the previous text, we discussed that compared to MR, the biggest advantage of MW is the ability to quickly configure permissions from the data path. Post Send Bind MW WR operation refers to the user issuing a WR to the SQ through the post send interface (such as ibv_post_send()), where the operation type of this WR (such as SEND/RDMA WRITE/RDMA READ) is specified as BIND MW. Additionally, the WR carries information about the permissions and the range of the MR to be bound. Unlike other WRs, after issuing a Bind MW WR, the hardware does not send any packets but instead binds the MW to the specified MR.

This method is only applicable to Type 2 MW introduced later.

Bind MW

Although this is an independent interface, it is actually an additional layer encapsulated outside Post Send Bind MW WR. The user provides the relevant information for MW binding, including permissions and the information of the MR to be bound. The driver is responsible for assembling and issuing the WR to the hardware. After the interface succeeds, the newly generated R_Key will be returned to the user.

This method is only applicable to Type 1 MW introduced later.

The relationship between the above two operations is as follows:

2024-06-28_12_6 — The relationship between two types of Bind operations

# Invalidation

Invalidate means invalidation, referring to the operation where a user sends a WR with an Invalidate opcode to the hardware to invalidate an R_Key.

It is important to emphasize that the object of the Invalidate operation is the R_Key, not the MW itself. The effect after Invalidate is that the remote user can no longer use this R_Key to access the corresponding MW, but the MW resource still exists, and new R_Keys can still be generated for remote use in the future.

The Invalidate operation can only be used for Type 2 MW introduced below.

According to the different initiators of the Invalidate operation, it can be further divided into two types:

Local Invalidate

Invalid local operation. If a higher-level user wants to revoke the R_Key permissions of a certain remote user without reclaiming MW resources, they can issue a Local Invalidate operation to the SQ. After the hardware receives it, it will modify the configuration of the corresponding MR. After successful execution, if the remote user holding this R_Key attempts to perform RDMA operations on the MW, the local hardware will reject it and return an error.

Because it is a local operation, the hardware will not send a message to the link after receiving this WR.

2024-06-28_12_7 — Software and Hardware Interaction of Local Invalidate Operation

Remote Invalidate

Remote invalid operation. When a remote user no longer uses an R_Key, they can proactively send a message to allow the local end to reclaim this R_Key. The remote user issues a WR with this operation code to the SQ, and once the hardware receives it, it will assemble a message and send it to the local end. After the local hardware receives the remote’s Remote Invalidate operation, it will set the corresponding R_Key to an unusable state. Just like Local Invalidate, thereafter the remote end will not be able to use this R_Key to perform RDMA operations on the corresponding MW.

2024-06-28_12_8 — Remote Invalidate operation's software and hardware interaction

# Type of MW

According to different implementations and application scenarios, the IB specification classifies MW:

# Type 1

Type 1 MW is associated with a PD and a QP, and it is not bound to a QP, so it will not affect the destruction of a QP under the same PD.

The key field of the R_Key for Type 1 MW is controlled by the driver and hardware. Here, “controlled” means that the key is allocated by the driver and hardware, not by the upper-level user. This is also the reason mentioned earlier that Type 1 MW cannot perform the Invalidate operation. If a user of Type 1 MW wants to invalidate an R_Key, they can bind this MW again through the Bind MW interface. The hardware or driver will automatically allocate a new key field for the R_Key, and the original R_Key will become invalid.

In addition, if a user temporarily wants to unbind an MW from any MR but still wants to retain the related resources instead of destroying this MW, they can achieve this by calling the Bind MW interface and setting the MW length to 0.

The IB specification allows multiple Type 1 MWs to be bound to the same MR, and their ranges can overlap.

# Type 2

Type 2 MW grants users greater freedom, with the key field segment of the R_Key controlled by the user, allowing them to allocate it as they wish. As mentioned earlier, users perform binding through the Post Send Bind MW WR operation, and this process does not return an R_Key. Users must remember the index from the Allocate MW operation and combine it with their chosen 8-bit key to form the R_Key and send it to the peer.

The user can invalidate an R_Key through the Invalidate operation introduced earlier. If you want to assign a new R_Key to the MW, you must first invalidate the previous R_Key through the Invalidate operation.

Unlike Type 1, Type 2’s MW does not support 0-length binding.

The IB specification also allows multiple Type 2s to be bound to the same MR, and the ranges can overlap.

In addition, based on different binding relationships, Type 2 can be further divided into two implementation methods, with their differences lying solely in the binding relationship with QP.

# Type 2A

Associated with a QP through QPN, meaning that when remote access occurs within this MW range, in addition to the R_Key, the correct QPN must also be specified. If a QP has a bound Type 2A MW, then this QP cannot be destroyed.

# Type 2B

By associating a QP with QPN and PD, there is an additional PD verification compared to Type 2A. When the remote end accesses the memory of the MW through RDMA operations, besides the QPN needing to be correct, the PD specified for the local QP must also be the same as the PD bound to this MW. Additionally, unlike Type 2A, a QP can be destroyed even if there is still a Type 2B MW binding relationship.

The introduction in the original IB specification is relatively scattered, so let’s briefly summarize the similarities and differences of several MWs:

	Type 1	Type 2A	Type 2B
Correlation	PD	QP	PD + QP
R_Key’s key field ownership	Driver + Hardware	User	User
Binding Method	Bind MW After binding, the previous R_Key automatically becomes invalid	Post Send Bind MWWR Before binding, the previous R_Key needs to be invalidated	Post Send Bind MWWR Before binding, the previous R_Key needs to be invalidated
Is zero length supported	Yes	No	No
Supports Invalidate	No	Yes	Yes
Can the associated QP be destroyed	-	No	Yes

In addition, the IB specification also provides the following descriptions for the above types: HCA must implement Type 1 MW, and can optionally choose to implement either Type 2A or 2B. Type 1 and Type 2 MW can be simultaneously associated with the same MR. Since I have not encountered many applications using MW, I cannot clearly explain in which scenarios each type of MW should be used. If readers have insights on this topic, they are welcome to share and discuss.

Alright, MW will be discussed up to here, and this concludes the introduction of common resources in RDMA technology.

Given that devices generally supporting RDMA are quite expensive, in the next article I will introduce how to conduct some programming experiments through software-simulated devices—namely Soft-RoCE.

3.5.3 Memory Keys Introduction
9.4.1.1 Invalidate Operation
10.6.7 Permission Management
11.2.10.9~12 Related Verbs Introduction

# Reference document

[1] IB Specification Vol 1-Release-1.4

[2] Linux Kernel Networking - Implementation and Theory. Chapter 13

RDMA: Memory Window

# What is Memory Window

The relationship between MR and MW