Ultra Ethernet: An Open High-Speed Fabric for AI Clusters

Large model training is changing the design assumptions behind data center networks. In the past, the network mostly served as the layer that “connected servers”; today, it has become a critical path for whether a GPU cluster can fully utilize its compute capacity. AllReduce, All-to-All, parameter synchronization, expert parallelism, and KV cache transfer all point to the same fact: an AI backend network must deliver high bandwidth, low tail latency, high path utilization, and large-scale operability at the same time.

Ultra Ethernet, or UE, is a new generation of high-performance Ethernet architecture that emerged in this context. It is developed by the Ultra Ethernet Consortium (UEC), and its core transport protocol is Ultra Ethernet Transport (UET). In June 2025, UEC released UEC Specification 1.0, which the consortium positions as a complete Ethernet communication stack for AI and HPC, covering NICs, switches, optical modules, cables, software interfaces, and more.¹

# UE’s Positioning: A System-Level Upgrade for AI/HPC Backend Networks

Ultra Ethernet is not meant to replace all Ethernet. Its first target is the high-performance backend network in AI/HPC clusters, commonly known as the Scale-Out Fabric. In the UE design paper, the specification authors explain that UE 1.0 mainly targets backend networks, especially scenarios with links above 400 Gbps, medium-distance connections, and large message payloads.²

Network types in the UE specification: frontend networks, backend Scale-Out networks, and intra-node Scale-Up networks — Figure 1: Network types in the UE specification. Local/Scale-Up networks connect CPUs and accelerators inside a node, Backend/Scale-Out networks connect compute devices, and Frontend networks carry traditional data center traffic. UE 1.0 primarily targets backend high-performance networks.

UE’s goal is not to abandon the Ethernet ecosystem. Instead, it adds new transport, congestion control, load balancing, and reliability mechanisms for AI/HPC while preserving Ethernet/IP compatibility as much as possible. The UE design paper states that UE uses IPv4/IPv6-compatible Layer 3 addressing and packet headers. The Fabric Endpoint (FEP) is the logical entity at both ends of the transport layer and can be roughly understood as the UE counterpart of a traditional NIC.²

# RoCEv2’s Historical Burden

The value of RoCEv2 is that it brought RDMA into routable Ethernet. NVIDIA’s documentation also describes RoCE as a protocol that uses RDMA capabilities to provide direct memory-to-memory transfer at the application layer, with hardware handling transport processing and memory placement.³

But RoCEv2 also carries obvious historical baggage. The UE design paper points out that RoCEv2 largely inherits the InfiniBand transport protocol, requiring lossless transport and strict in-order delivery. In converged Ethernet, this usually relies on PFC as the primary mechanism. PFC requires independent traffic classes and larger headroom buffers, and it can cause congestion spreading and head-of-line blocking.²

This is exactly where large model clusters feel pain. Training traffic is often highly synchronized, bursty, and concentrated. Congestion at one point can propagate upstream through PFC pause mechanisms and eventually form chain blocking. Meanwhile, traditional ECMP usually selects paths through flow hashing, so packets from the same flow follow the same path. Once large flows collide in the hash result, some links become congested while others remain idle. UEC’s official blog also describes UET packet spraying as an improvement over ECMP flow collision.⁴

# UE Protocol Stack Shape

Layered view in the UE specification: software API, UE Transport, IP, Link, and PHY — Figure 2: Layered view in the UE specification. UE's largest change is concentrated in the Transport layer. The PHY, Link, and Network layers remain Ethernet/IP-compatible while introducing several optional enhancements.

From a layered perspective, the core of Ultra Ethernet is UET. The UEC 1.0 white paper calls UET the most important deliverable of UEC 1.0: it provides network-to-application-memory and application-memory-to-network data delivery, which is RDMA capability, while introducing mechanisms that differ from existing RDMA protocols.

UET protocol stack, sublayer division, and packet structure — Figure 3: UET protocol stack and packet structure. UET sits between application-layer interfaces and the IP network layer. Its SES, PDS, CMS, and TSS sublayers carry semantics, packet delivery, congestion management, and transport security respectively. The lower layers still rely on Ethernet data link and physical layers, with optional enhancements such as Link Level Retry and Credit-Based Flow Control.

This figure further clarifies UE’s design boundary: UET is not an isolated “new RDMA protocol.” It is a transport-layer system located between upper-layer communication interfaces such as libfabric and lower-layer IP/Ethernet networks. It places application-side semantics such as Send/Receive and RMA Read/Write in the SES layer; requests, responses, control packets, and loss detection in the PDS layer; window-based sending, optional receiver-side congestion control, and load balancing in the CMS layer; and key handling, replay protection, and other security capabilities in the TSS layer. In other words, UE’s main changes are concentrated in the transport layer, but it does not leave Ethernet/IP. It adds a new transport system for AI/HPC communication patterns on top of the existing network, data link, and physical layers.

The key point of this design is not simply to tune RoCEv2 again. It re-examines several basic assumptions in AI/HPC networks: must the network be strictly lossless? Must packets arrive in order? Can connection state between large-scale endpoints continue to grow? Can congestion only be avoided by pausing links?

UE’s answer is closer to endpoint-network collaboration: the network can be more flexible, and endpoints must be smarter; congestion can be exposed earlier, and recovery must happen faster. Along this line, UET’s core mechanisms can be summarized in three points: out-of-order friendly, congestion aware, and lightweight in connection state.

# Key UET Mechanisms

# Packet Spraying

Packet Spraying is UET’s core mechanism for solving path utilization.

Traditional ECMP usually pins the same flow to one path. UE introduces the Entropy Value (EV) mechanism. A source FEP can choose different EVs for different packets, causing those packets to take different paths in an ECMP network. If ordering is required, it can choose the same EV.²

Tip
Here, EV can be understood as a “perturbation value” used by ECMP hashing. It is not a business ID, nor is it a direct path identifier. By setting different EVs for different packets, the source endpoint makes switches produce different ECMP hash results, distributing packets from the same large flow across multiple equal-cost paths. If the workload needs to preserve ordering, related packets can use the same EV so that they continue to follow the same path.

Therefore, UE’s idea is not “one large flow chooses one path,” but rather “different packets from one large flow can be distributed across multiple paths.” UEC’s official blog summarizes it this way: a UET sender can spray packets across multiple paths toward the destination, avoiding ECMP flow collision and making link load more balanced.⁴

One misunderstanding should be avoided: UE does not simply require the receiver to reorder all out-of-order packets. UET defines several transport modes, including RUD, ROD, UUD, and RUDI. RUD (Reliable Unordered Delivery) is the default efficient and reliable mode for large messages because it allows packets to arrive out of order in the network and supports packet spraying. The UE specification authors also point out that RUD is one of the most efficient reliable transport modes in UET.²

# Packet Trimming

Packet Trimming is UET’s fast awareness mechanism for congestion-related packet loss.

In traditional designs, when a switch runs out of buffer space, it usually has only two choices: drop packets, or pause upstream traffic through mechanisms such as PFC. UE provides a third choice: when a packet would otherwise be dropped because of congestion, a switch that supports this feature can trim off the payload, keep only the necessary headers, and continue forwarding this “trimmed packet” to the destination. After receiving the trimmed packet, the destination knows that the original payload did not arrive successfully and can quickly request retransmission from the source.²

The value of this mechanism is that it turns “implicit loss” into an “explicit signal.” UEC’s official blog also describes packet trimming as an advanced telemetry mechanism: during congestion, a switch truncates a packet instead of directly dropping it, then sends the header and related congestion information to the receiver so that incast congestion can be mitigated more quickly.⁵

In short, first, Packet Trimming is optional; second, it requires switch support; third, it is mainly used for congestion loss detection and cannot detect corruption drops caused by link bit errors.²

# Ephemeral PDC

UE also rethinks the cost of connections.

In the traditional RDMA model, connections, queue pairs, and resource reservation often bring significant state overhead. In traditional RDMA, such as RoCEv2 or InfiniBand, two servers must first establish a QP before they can communicate.

The QP problem is persistence: once established, the connection remains in the NIC’s hardware memory until it is explicitly destroyed.
Scale explosion: in an AI training cluster with tens of thousands or even hundreds of thousands of GPUs, if each process on each node needs to connect to other nodes for All-to-All collective communication, the required number of QPs across the network can grow geometrically at the $ O(N^2) $ level.
NIC memory exhaustion: the high-speed cache on a NIC chip is very limited and cannot hold hundreds of thousands of QP state entries. This can cause frequent QP cache thrashing and make network performance collapse.

UE introduces ephemeral Packet Delivery Contexts (PDCs) to manage reliable packet delivery from source to destination. The specification explains that a PDC can be created when the first packet arrives, without adding extra first-packet latency.² In simple terms, the core idea is this: instead of reserving hardware connections for all possible targets during initialization, UE dynamically creates a “temporary channel” only at the moment of communication and releases it immediately after transmission finishes.

This is the real meaning behind “short-lived connections” or “0-RTT connection startup.” It does not mean physical latency disappears. It means UET avoids the extra handshake wait in traditional connection models, making reliable transport context creation lighter, faster, and more suitable for large-scale concurrent communication.

# Congestion Control

UE does not rely on a single mechanism to solve congestion. UET’s congestion management subsystem includes congestion control and load balancing. A baseline deployment only requires switches to support ECMP and basic ECN. At the same time, UE can use fast loss detection mechanisms such as packet trimming to improve recovery efficiency.²

UET defines two complementary congestion control algorithms: NSCC (Network Signal-based Congestion Control) and RCCC (Receiver Credit-based Congestion Control). NSCC runs a control loop at the source and adjusts the window based on network signals such as RTT, ECN, and packet loss. RCCC allocates credits from the receiver side and is an optional receiver-driven mechanism.²

UEC’s official blog also explains this point: UET sender congestion control adjusts the window based on RTT, ECN markings, and packet loss; the receiver-side credit mechanism allows the sender to request send permission, with the receiver granting credits to prevent itself from being overwhelmed by incast traffic.⁴

From this perspective, UE’s congestion control is not about pushing a single algorithm to the limit. It combines network signals, receiver feedback, and fast loss awareness into a closed loop better suited to bursty AI cluster traffic.

# Comparing IB, RoCEv2, and UE

The differences become clearer when InfiniBand, RoCEv2, and Ultra Ethernet are placed in the same coordinate system. IB’s high-performance capabilities include mature congestion control, dynamic routing, SHARP, and other in-network computing capabilities. RoCEv2 brings RDMA into the Ethernet/IP system, but typical lossless deployments rely on PFC and ECN. UE aims to standardize a new transport layer and congestion control mechanisms for AI/HPC within the Ethernet ecosystem.⁶

Dimension	InfiniBand (IB)	RoCEv2	Ultra Ethernet (UE / UET)
Technical positioning	Dedicated high-performance interconnect for HPC/AI	RDMA transport over Ethernet/IP	Enhanced Ethernet communication stack for AI/HPC
Ecosystem base	Dedicated IB devices, management, and operations	Reuses Ethernet switches and IP networks	Built on Ethernet/IP compatibility, extending transport, congestion control, and link capabilities
RDMA semantics	Native RDMA	Inherits IBTA RDMA semantics	Modern RDMA over Ethernet/IP, centered on UET
Delivery model	High-performance, low-latency, mature dedicated fabric	Usually requires lossless or near-lossless Ethernet	Supports best-effort/lossy networks and can also run on lossless networks
Flow control	Dedicated congestion control, QoS, virtual lanes, and related capabilities	Typical lossless mode depends on PFC, combined with ECN/CNP	Endpoint-side congestion control as the core, supporting NSCC, optional RCCC, optional CBFC/LLR
Load balancing	Supports dynamic/adaptive routing, depending on implementation	Common ECMP flow-based hashing, with large-flow collision risk	Supports packet spraying and uses EV for packet-level multipath distribution
Ordering model	Traditionally emphasizes reliable in-order semantics	Strict in-order delivery constraints are obvious	Supports ROD, RUD, UUD, and RUDI, optimized for out-of-order scenarios
Loss recovery	Reliability mechanisms inside dedicated fabrics	Packet loss is costly and often avoided through PFC	Supports fast loss detection, with optional packet trimming for precise retransmission
Connection state	Mature but relatively state-heavy	Queue Pair and related connection state are relatively heavy	Ephemeral PDC reduces connection state pressure for large-scale endpoints
Interoperability	High performance but relatively concentrated ecosystem	Ethernet ecosystem-friendly, but tuning is complex	Aims for open standards, multi-vendor interoperability, and less lock-in
Maturity	Mature commercial deployments	Mature large-scale data center deployments	Specification released; ecosystem and products are still maturing rapidly

# Architectural Benefits

Ultra Ethernet benefit comparison table from the UEC white paper — Figure 7: Benefit comparison table from the UEC 1.0 white paper, listing key changes from traditional RDMA networks to Ultra Ethernet, including out-of-order delivery, packet spraying, congestion control, security capabilities, and large-scale endpoint goals.

The most important thing about Ultra Ethernet is not any single feature. It is that UE puts multiple long-standing AI networking problems into one protocol framework.

It uses Packet Spraying to improve path utilization, Packet Trimming to shorten congestion loss awareness time, Ephemeral PDC to reduce connection state overhead, NSCC / RCCC to build endpoint-side closed-loop congestion control, and multiple transport modes to fit the ordering semantics required by different AI and HPC workloads.

UEC officially describes UEC 1.0 as a complete Ethernet communication stack spanning NICs, switches, optical modules, and cables, and emphasizes its goals of open standards, interoperability, and avoiding vendor lock-in.¹

But precisely because UE is trying to solve a system-level problem, its adoption will not depend on one protocol version alone. Baseline UET can make use of existing Ethernet/IP capabilities such as ECMP and ECN as much as possible. But to fully exploit packet trimming, CBFC, LLR, link-level enhancements, end-to-end security, and advanced congestion control, NICs, switches, software stacks, operations tools, and interoperability certification must mature together. The UE design paper also notes that UE’s physical and link layers remain Ethernet-compatible while defining several optional extensions to further improve performance in new deployments.²

So instead of viewing UE as an “immediate replacement for RoCEv2,” it is better to see it as a clearer standardization path. It tries to turn capabilities that previously depended on dedicated networks, vendor implementations, and complex tuning experience into common mechanisms within the open Ethernet ecosystem.

# Directions Worth Watching

RoCEv2 once brought RDMA into Ethernet data centers, which is an important historical contribution. But as large model clusters continue to grow, the problems of PFC dependency, flow-based hashing, strict in-order delivery, connection state, and tail latency are amplified further.

The significance of Ultra Ethernet is that it does not continue down the path of “turning Ethernet into a lossless network.” Instead, it accepts the reality of AI/HPC communication: packets may arrive out of order, congestion will happen, large flows may collide, connection state cannot expand without limit, and tail latency directly affects overall training efficiency.

Under this premise, UE’s value is not merely that it is “faster,” but that it is “better matched to the problem.” It tries to make Ethernet more than a channel for carrying RDMA, and instead pushes it toward a high-performance interconnect natively optimized for AI/HPC. For infrastructure builders who want to balance open ecosystems, controllable cost, and high-performance capability, this direction itself deserves serious attention.

# Summary

If InfiniBand represents dedicated high-performance interconnects, and RoCEv2 represents RDMA entering Ethernet, then Ultra Ethernet may represent the next system-level upgrade of high-performance Ethernet.

It is not a simple tweak to RoCEv2, nor a direct copy of InfiniBand. It is more like absorbing proven AI/HPC networking design experience back into the main Ethernet path: multipath utilization, fast congestion feedback, endpoint-side reliable recovery, lightweight connection state, and a more open interoperability ecosystem.

Ultra Ethernet: An Open High-Speed Fabric for AI Clusters

Ultra Ethernet targets AI and HPC backend networks, aiming to standardize new transport, congestion control, load balancing, and reliability mechanisms within the Ethernet/IP ecosystem.

# UE’s Positioning: A System-Level Upgrade for AI/HPC Backend Networks

Figure 1: Network types in the UE specification. Local/Scale-Up networks connect CPUs and accelerators inside a node, Backend/Scale-Out networks connect compute devices, and Frontend networks carry traditional data center traffic. UE 1.0 primarily targets backend high-performance networks.

# RoCEv2’s Historical Burden

# UE Protocol Stack Shape

Figure 2: Layered view in the UE specification. UE's largest change is concentrated in the Transport layer. The PHY, Link, and Network layers remain Ethernet/IP-compatible while introducing several optional enhancements.

# Key UET Mechanisms

# Packet Spraying

Figure 4: UE Packet Spraying diagram