Home Search Center Intelligent Model Selection IP Encyclopedia

What Is AI ECN?

The Artificial Intelligence Explicit Congestion Notification (AI ECN) function intelligently adjusts ECN thresholds of lossless queues based on the traffic model on the live network. This function ensures low delay and high throughput with zero packet loss, achieving optimal performance for lossless services.

How Does AI ECN Differ from ECN?

The congestion control mechanism most widely applied to RDMA over Converged Ethernet version 2 (RoCEv2) networks works as follows to relieve congestion: After a network device detects congestion, it sends an ECN-marked packet to the receiving server, which then sends a Congestion Notification Packet (CNP) to the sending server, instructing the sending server to reduce its packet sending rate.

Congestion control mechanism
Congestion control mechanism

Both AI ECN and ECN use this mechanism for congestion control. However, the traditional ECN function requires manual configuration of ECN thresholds, and an ECN-enabled device can detect congestion only when the buffer usage exceeds the configured ECN threshold. For lossless services that require lossless transmission, the manually configured ECN thresholds cannot adapt to the changing buffer space in a queue or fulfill the requirements of traffic models with different characteristics on the network.

AI ECN can address these issues. Leveraging intelligent algorithms, AI ECN for lossless queues enables a device to perform AI training based on the traffic model on the live network and adjust ECN thresholds based on traffic characteristics (such as the queue length). In this way, the lossless queue buffer is accurately managed and controlled, ensuring the optimal performance across the entire network.

Why Do We Need AI ECN?

To implement traffic control and relieve buffer congestion for lossless queues, you can configure two types of buffer thresholds — ECN and PFC — for lossless queues. When the buffer usage of the outbound queue on a device reaches the ECN threshold, the device instructs the sending server to reduce the packet sending rate. When the buffer usage of the inbound queue on the device reaches the PFC threshold, the device instructs the upstream device to stop sending traffic. Given the fact that congestion rarely happens in the inbound direction as long as there is no congestion in the outbound direction, preferentially trigger the ECN threshold in the case of a congestion to instruct the sending server to reduce the packet sending rate. Triggering PFC is not recommended, as it may interrupt traffic.

ECN threshold and PFC threshold for relieving congestion
ECN threshold and PFC threshold for relieving congestion

A proper ECN threshold is especially important for ensuring low delay and high throughput with zero packet loss. However, the size, rate, and buffer usage of traffic on the network are constantly changing. Different types of traffic have different requirements on the ECN threshold. Multiple factors need to be considered when setting the ECN threshold. For example:

  • Interval between the time when the device sends a CNP and the time when the sending server reduces its packet sending rate

    There is a gap between the network device's sending a CNP and the sending server's reducing the packet sending rate. During this period, the server continues to transmit traffic to the device at the original packet sending rate. As a result, congestion in the queue buffer gets worse. This, in turn, causes PFC to be triggered and transmission suspended, affecting services. To minimize impacts on services, we need to ensure PFC is triggered only when necessary. To this end, proper ECN thresholds need to be set so that the buffer gap between ECN and PFC thresholds is sufficient to accommodate the traffic sent by the server before it becomes aware of the congestion.

  • A balance between delay-sensitive mice flows and throughput-sensitive elephant flows
    • When high ECN thresholds are set, ECN marking can be delayed. This ensures the traffic sending rate and the buffer space for storing burst traffic in a queue, meeting the bandwidth requirement of throughput-sensitive elephant flows. However, when congestion occurs in a queue, packets are queued in the buffer space, leading to a long queue delay which is not beneficial to delay-sensitive mice flows.
    • When low ECN thresholds are set, ECN marking is triggered as soon as possible to instruct the server to reduce the packet sending rate. This ensures a low buffer depth, reduces packet queuing, and lowers the queue delay, which is beneficial to delay-sensitive mice flows. However, a low ECN threshold affects throughput-sensitive elephant flows by limiting their bandwidth, and cannot ensure high throughput for elephant flows.

To balance delay-sensitive mice flows and throughput-sensitive elephant flows in complex and changeable traffic scenarios on the live network, AI ECN is introduced. It dynamically predicts network traffic changes based on the traffic model on the live network and adjusts the optimal ECN threshold in real time. In this way, the lossless queue buffer is accurately managed and controlled, ensuring zero packet loss, low delay, and high throughput for RoCEv2 traffic.

How Does AI ECN Work?

AI ECN uses embedded AI (EAI) for intelligent computing. EAI is a general framework system built in the device for AI functions. It provides model management, data acquisition, and preprocessing functions for AI ECN, and can send inference results to the AI ECN. As shown in the figure, the device collects traffic characteristics on the live network and sends them to the AI ECN component. The AI ECN component intelligently sets the optimal ECN thresholds for lossless queues based on the inference result of the EAI system to ensure low delay and high throughput of lossless queues. In this way, optimal performance of lossless services can be achieved in different traffic scenarios.

Implementation of the AI ECN function for lossless queues
Implementation of the AI ECN function for lossless queues
  1. The forwarding component on the network device collects traffic characteristics such as the queue buffer usage, bandwidth throughput, and current ECN thresholds, and uses telemetry to push the real-time network traffic status to the AI ECN component.
  2. After the AI ECN function is enabled, the AI ECN component automatically subscribes to EAI system services. After receiving the pushed network traffic status information, the AI ECN component intelligently determines the current traffic model and identifies whether the current network traffic scenario is a known scenario based on the EAI system.
    • If the traffic model is a trained model in the EAI system, the AI ECN component determines that the current network traffic scenario is a known scenario. The AI ECN component then calculates the ECN thresholds that match the current network status based on the optimal result inferred by the EAI system. This mode is known as the model inference mode. As it uses the Neural Network (NN) algorithm, it is also referred to as NN mode.
    • If the traffic model is not a trained model in the EAI system, the AI ECN component determines that the current network traffic scenario is an unknown scenario. The AI ECN component then uses the heuristic search algorithm to continuously correct the current ECN thresholds in real time based on the live network status, ensuring high bandwidth and low delay. Finally, optimal ECN thresholds are obtained. This mode is known as the heuristic inference mode. As it uses Bottleneck Bandwidth and Round-trip propagation time (BBR) algorithm, it is also referred to as BBR mode.
  3. The AI ECN component delivers the optimal ECN threshold configuration to the device, which then adjusts the ECN thresholds of lossless queues accordingly.
  4. The device performs the preceding operations for new traffic status to ensure optimal performance of lossless services.
About This Topic
  • Author: Feng Yuanyuan
  • Updated on: 2024-02-27
  • Views: 2879
  • Average rating:
Share link to