RDMA Basic Service Types

This article welcomes non-commercial reprints, please indicate the source when reprinting.

Statement: For collection only, for convenient reading
― Savir, Zhihu Column: 5. Basic RDMA Service Types

In the article 【“3. RDMA Basic Elements”】 , we mentioned that the basic communication unit of RDMA is QP, and there are many communication models based on QP, which we refer to as “service types” in the field of RDMA. The IB protocol describes a service type through two dimensions: “reliable” and “connected”.

# Reliable

Reliability in communication refers to ensuring that the sent data packets can be properly received through some mechanisms. In the IB protocol, reliable service is described as follows:

Reliable Service provides a guarantee that messages are delivered from a requester to a responder at most once, in order and without corruption.

“Reliable service ensures that information is transmitted at most once between the sender and receiver, and it can guarantee that it is completely received in the order it was sent.”

IB ensures reliability through the following three mechanisms:

# Response mechanism

Suppose A sends a data packet to B, how can A know that B has received it? Naturally, B replies with a “I have received it” message to A. In the field of communications, we generally refer to this reply as an acknowledgment packet or ACK. In the reliable service type of the IB protocol, an acknowledgment mechanism is used to ensure that the data packet is received by the other party. In the reliable service type of IB, the receiver does not have to reply to every packet; it can also reply with an ACK for multiple packets at once. We will discuss this further later.

# Data validation mechanism

This is relatively easy to understand. The sender will use a certain algorithm to obtain a checksum for the Header and Payload (the actual data to be sent and received) and place it at the end of the data packet. When the receiving end receives the data packet, it will also use the same algorithm to calculate the checksum and then compare it with the checksum in the data packet. If they do not match, it indicates that the data contains errors (usually caused by link issues), and the receiving end will discard this data packet. The IB protocol uses CRC for checksum, and this article does not provide an in-depth introduction to CRC.

# Order-preserving mechanism

In-order delivery refers to ensuring that data packets sent first over the physical link are received by the recipient before later sent packets. Some services have strict requirements on the order of data packets, such as voice or video. The IB protocol includes the concept of PSN (Packet Sequence Number), meaning each packet has an incrementing number. PSN can be used to detect packet loss; for example, if the receiver gets 1 but receives 3 without having received 2, it will consider an error occurred during transmission and will send a NAK back to the sender, requesting the retransmission of the lost packet.

Unreliable service, without the above mechanisms to ensure that packets are received correctly, belongs to the type of service that is “just send it out, I don’t care if it is received or not.”

# Connection and Datagram

Connection here refers to an abstract logical concept, which needs to be distinguished from a physical connection. Readers familiar with Sockets will certainly not be unfamiliar with this. A connection is a communication “pipeline.” Once the pipeline is established, the data sent from this end of the pipeline will definitely reach the other end along this pipeline.

There are many definitions for “connection” or “connection-oriented”, some focus on ensuring the order of messages, some emphasize the uniqueness of the message delivery path, some highlight the need for software and hardware overhead to maintain the connection, and some overlap with the concept of reliability. Since this column is about introducing RDMA technology, let’s take a look at its description in section 3.2.2 of the IB protocol:

IBA supports both connection-oriented and datagram service. For connected service, each QP is associated with exactly one remote consumer. In this case, the QP context is configured with the identity of the remote consumer’s queue pair. … During the communication establishment process, this and other information is exchanged between the two nodes.

That is, “IBA supports both connection-oriented and datagram-based services. For connection-oriented services, each QP is associated with another remote node. In this case, the QP Context contains the QP information of the remote node. During the process of establishing communication, the two nodes exchange peer information, including the QP that will be used for communication later.”

In the description above, Context is generally translated as 上下文. QP Context (abbreviated as QPC) can be simply understood as a table that records information related to a QP. We know that QP consists of two queues, and in addition to these two queues, we also need to record information about the QP in a table. This information may include the depth of the queues, the queue numbers, etc. We will elaborate on this later.

It might still be a bit abstract, let’s use a diagram to explain:

The network cards of nodes A, B, and A, C are physically connected. A’s QP2 and B’s QP7, A’s QP4 and B’s QP2 have established a logical connection, or are “bound together.” In the connection service type, each QP is connected to a unique other QP, meaning that the destination of each WQE issued by the QP is unique. For example, for each WQE issued by A’s QP2, the hardware can know through QPC that its destination is B’s QP7, and will send the assembled packet to B. Then B will store the data according to the RQ WQE issued by QP7; similarly, for each WQE issued by A’s QP4, A’s hardware knows that the data should be sent to Node C’s QP2.

How is a “connection” maintained? Actually, it’s just a record inside the QPC. If A’s QP2 wants to disconnect from B’s QP7 and then “connect” with another QP, it only needs to modify the QPC. During the process of establishing a connection between two nodes, they exchange the QP Number that will be used later for data interaction, and then record it in the QPC respectively.

Datagram Contrary to connection, there is no need for a “pipeline establishment” step between the sender and receiver. As long as the sender can physically reach the receiver, it is possible to send to any receiving node from any path. The IB protocol defines it as follows:

For datagram service, a QP is not tied to a single remote consumer, but rather information in the WQE identifies the destination. A communication setup process similar to the connection setup process needs to occur with each destination to exchange that information.
“For datagram services, a QP will not be bound to a unique remote node but will specify the destination node through a WQE. Similar to connection-type services, the process of establishing communication requires both ends to exchange peer information, but for datagram services, this exchange process needs to be executed once for each destination node.”

Let’s take an example:

In the context of a datagram-type QP, it does not contain peer information, meaning each QP is not bound to another QP. Each WQE issued to the hardware by the QP may point to a different destination. For example, the first WQE issued by QP2 of node A instructs to send data to QP3 of node C; while the next WQE may instruct the hardware to send to QP7 of node B.

Like the connection service type, which remote QP the local QP can send data to is mutually informed in advance during the preparation stage through certain means. This is also the meaning of the above statement “the datagram service needs to perform this exchange process once for each destination node.”

# Service type

The two dimensions mentioned above combine in pairs to form the four basic service types of IB:

	Reliable	Unreliable
Connection	RC (Reliable Connection)	UC (Unreliable Connection)
Datagram	RD (Reliable Datagram)	UD (Unreliable Datagram)

RC and UD are the most applied and fundamental types of services, and we can analogize them to the TCP and UDP of the TCP/IP protocol stack’s transport layer, respectively.

RC is used in scenarios with high requirements for data integrity and reliability, similar to TCP, because various mechanisms are needed to ensure reliability, so the overhead will naturally be higher. Additionally, since RC service types and each node need to maintain their own QP, assuming there are N nodes that need to communicate with each other, at least N * (N - 1) QPs are required. QP and QPC themselves need to occupy network card resources or memory, and when there are many nodes, the consumption of storage resources will be very large.

UD hardware overhead is small and saves storage resources. For example, if N nodes need to communicate with each other, only N QPs need to be created. However, reliability cannot be guaranteed, just like UDP. If users want to implement reliability based on the UD service type, they need to implement an application-layer reliable transmission mechanism based on the IB transport layer themselves.

In addition, there are RD and UC types, as well as more complex service types like XRC (Extended Reliable Connection) and SRD (Scalable Reliable Datagram), which we will describe in detail in the protocol analysis section.

For more information on QP type selection, you can refer to the article Which Queue Pair type to use? on RDMAmojo. Thanks to @sinkinben for pointing it out in the comments section.

# Code example

In RDMA programming, we can create a QP using the ibv_create_qp function, where the qp_type field in the struct ibv_qp_init_attr structure is used to specify the service type of the QP. Below is a simple example code:

1
2
3
4
5
6
7
8
struct ibv_qp_init_attr qp_init_attr;
qp_init_attr.qp_type = IBV_QPT_RC; // RC type
qp_init_attr.sq_sig_all = 1; // 1 means each WQE in SQ needs a corresponding CQE
qp_init_attr.send_cq = cq; // Send CQ
qp_init_attr.recv_cq = cq; // Receive CQ
qp_init_attr.cap.max_send_wr = 1024; // Depth of SQ

struct ibv_qp *qp = ibv_create_qp(pd, &qp_init_attr);