Featured image of post RDMA: Memory Region

RDMA: Memory Region

This article is reprinted from Zhihu Column: 6. RDMA Memory Region, Author: Savir. The essence of network communication is the migration of data in storage media, and RDMA manages memory through MR. MR is a special memory region, and this article introduces its concept and function.

# Memory Region of RDMA

This article welcomes non-commercial reproduction, please indicate the source.

Statement: For collection only, for easy reading

Savir, Zhihu Column: 6. RDMA Memory Region

We assume a scenario and also take the opportunity to review the RDMA WRITE operation process:

As shown in the figure below, Node A wants to write a piece of data into Node B’s memory via the IB protocol. The upper-layer application issues a WQE to the RDMA network card of the local node. The WQE contains information such as source memory address, destination memory address, data length, and key. Then the hardware retrieves the data from memory, packages it, and sends it to the remote network card. After Node B’s network card receives the data, it parses the destination memory address and writes the data into the local node’s memory.

6_1-2024-04-04

So the question arises, the addresses provided by the APP are all virtual addresses (Virtual Address, referred to as VA below), which need to be converted by the MMU to obtain the real physical address (Physical Address, referred to as PA below). How does our RDMA network card obtain the PA to fetch data from memory? Even if the network card knows where to fetch the data, if a user maliciously specifies an illegal VA, wouldn’t the network card possibly be “instructed” to read and write critical memory?

To solve the above problem, the IB protocol proposed the concept of MR.

# What is MR

MR stands for Memory Region, which refers to a region designated by the RDMA software layer in memory for storing transmitted and received data. In the IB protocol, after a user requests a memory region for storing data, they must register the MR by calling the API provided by the IB framework to allow the RDMA network card to access this memory region. As can be seen from the diagram below, MR is just a special piece of memory:

6_2-2024-04-04

When describing the IB protocol, we usually refer to the RDMA hardware as HCA (Host Channel Adapter). The IB protocol defines it as “an IB device in processors and I/O units capable of generating and consuming packets.” To remain consistent with the protocol, we will refer to the hardware part as HCA in this and subsequent articles.

# Why register MR

Let’s take a look at how MR addresses the two questions raised at the beginning of this article:

# 1. Register MR to achieve virtual-to-physical address translation

We all know that an APP can only see virtual addresses, and it will directly pass the VA to the HCA in the WQE (including both the source VA on the local end and the destination VA on the remote end). Modern CPUs have the “tool” of MMU and page tables to perform the conversion between VA and PA, while the HCA is either directly connected to the bus or connected to the bus after address translation through IOMMU/SMMU. It cannot “understand” the real physical memory address corresponding to the VA provided by the APP.

So during the process of registering MR, the hardware will create and fill a VA to PA mapping table in memory, so that when needed, VA can be converted to PA by looking up the table. Let’s provide a specific example to explain this process:

6_3-2024-04-04

Now assume that the node on the left initiates an RDMA WRITE operation to the node on the right, directly writing data into the memory area of the right node. Assume that both ends in the diagram have already completed the registration of MR, which corresponds to the “data Buffer” in the diagram, and have also created the VA->PA mapping table.

  • First, this end’s APP will issue a WQE to the HCA, informing the HCA of the virtual address of the local buffer used to store the data to be sent, as well as the virtual address of the peer data buffer that will be written to.
  • This end HCA queries the VA->PA mapping table to obtain the physical address of the data to be sent, then retrieves the data from memory, assembles the data packet, and sends it out.
  • The remote HCA received the packet and parsed the destination VA from it.
  • The peer HCA uses the VA->PA mapping table stored in local memory to find the real physical address, verifies the permissions, and then stores the data in memory.

Emphasize once again, for the right-side node, whether it’s address translation or writing to memory, it does not require any involvement of its CPU.

# 2. MR can control HCA’s access to memory permissions

Because the memory address accessed by the HCA comes from the user, if the user provides an illegal address (such as system memory or memory used by another process), HCA reading or writing to it may cause information leakage or memory corruption. Therefore, we need a mechanism to ensure that HCA can only access authorized and safe memory addresses. In the IB protocol, during the preparation stage for data interaction, the APP needs to perform the action of registering MR.

When a user registers MR, two keys are generated—L_KEY (Local Key) and R_KEY (Remote Key). Although they are called keys, their entities are actually just a sequence. They will be used to ensure access permissions for the local and remote memory regions, respectively. The following two diagrams are schematic representations describing the functions of L_Key and R_Key:

6_4-2024-04-04

L_Key

6_5-2024-04-04

R_Key

Here, everyone might have a question: how does this end know the available VA and the corresponding R_Key of the peer node? In fact, before the actual RDMA communication, both nodes establish a link through some means (it could be a Socket connection or a CM connection) and exchange some necessary information for RDMA communication (VA, Key, QPN, etc.) through this link. We call this process “link establishment” and “handshake.” I will introduce this in detail in the following articles.

In addition to the two points above, registering MR has another important function:

# 3. MR can avoid page swapping

Because physical memory is limited, the operating system uses a paging mechanism to temporarily save the unused memory contents of a process to the hard drive. When the process needs to use it, a page fault interrupt is used to move the contents from the hard drive back to memory, and this process almost inevitably causes the VA-PA mapping relationship to change.

Since HCA often bypasses the CPU to read and write to the physical memory areas pointed to by the VA provided by the user, if the VA-PA mapping relationship changes, then the VA->PA mapping table mentioned earlier will lose its significance, and HCA will be unable to find the correct physical address.

In order to prevent the VA-PA mapping relationship from changing due to page swapping, the memory is “Pinned” when registering MR (also known as “page locking”), which means locking the VA-PA mapping relationship. In other words, this MR memory area will remain in physical memory without being swapped out until the communication is completed, and the user actively deregisters this MR.

Alright, we have now finished introducing the concept and function of MR. In the next article, I will introduce the concept of PD (Protection Domain).

# Code example

Below is a simple RDMA program demonstrating how to register MR:

#include <infiniband/verbs.h>

int main() {
    // Omit initialization process...
    struct ibv_mr *mr;
    mr = ibv_reg_mr(pd, buf, 1024, IBV_ACCESS_LOCAL_WRITE |
                                   IBV_ACCESS_REMOTE_WRITE);
    // Get L_Key and R_Key
    uint32_t lkey = mr->lkey;
    uint32_t rkey = mr->rkey;

    // Omit other code...
}
本博客已稳定运行
总访客数: Loading
总访问量: Loading
发表了 25 篇文章 · 总计 60.67k

Built with Hugo
Theme Stack designed by Jimmy
基于 v3.27.0 分支版本修改