What Is AFR?
AI foundation model training requires massive data transmission, demanding higher network capacity. Adaptive flow repathing (AFR) is a flow-level load balancing technology that collects and analyzes flow information in real time to adaptively adjust traffic forwarding paths. This implements global network load balancing, maintains high network throughput, and ensures the training efficiency of AI foundation models.
Why Do We Need AFR?
AI foundation models are widely used in various industries. Many enterprises rent computing power for model training, meaning that they need to transmit massive data samples over the WAN to intelligent computing centers. This increases the demand for higher network throughput.
The flows generated during training data and sample transmission are typically elephant flows. Although such flows are few in number, they feature high bandwidth and a long duration. They can lead to uneven network load, traffic congestion, and a significant drop in network throughput, with each such flow reaching a bandwidth of up to 10 Gbps.
Common Flow |
Elephant Flow |
|---|---|
Many in number |
Few in number |
Low bandwidth per flow |
High bandwidth per flow |
Short duration |
Long duration |
Traditional carrier networks use SRv6 TE Policies to optimize traffic, achieving load balancing. However, because the system cannot detect or collect traffic volume statistics of each path in real time, traffic can be allocated based only on predefined weights. This leads to unsatisfactory optimization and uneven network loads, failing to support concurrent training tasks of multiple users and causing a long waiting time. In addition, data samples cannot be promptly uploaded, wasting computing resources.
Network traffic congestion caused by elephant flows
To address the preceding issues, Huawei proposes AFR technology.
How Does AFR Work?
Unlike traditional traffic optimization technologies, AFR can identify the traffic volume and perform refined scheduling on data flows. Its working principles are as follows:
1. Elastic maximum throughput path computation: iMaster NCE-IP is used to manage devices on the entire network. Targeting maximum network throughput, it plans paths using AI algorithms to maximize network capacity.
2. Intelligent elephant flow identification: After AFR is deployed, the device can accurately identify elephant flows, and it can collect and report flow information to iMaster NCE-IP in real time.
3. Flow-based adaptive optimization: iMaster NCE-IP monitors network load in real time and dynamically adjusts the forwarding paths of service flows based on the load on each path. This implements global load balancing and ensures high network throughput.
If elephant flows cause congestion on some links along a path, iMaster NCE-IP replans the traffic forwarding path and adjusts some traffic to other paths. This implements load balancing among paths and achieves high network throughput, while also ensuring the efficiency of uploading training data and samples in scenarios where multiple users perform training tasks concurrently.
Application Scenarios of AFR
Thanks to achieving high network throughput, AFR — a flow-level load balancing technology — is primarily used in enterprise intelligent data center access scenarios, including data express and training based on local storage and remote computing.
Data express
In data express scenarios, TB-scale sample data needs to be quickly transmitted to intelligent data centers. AFR implements network load balancing through adaptive flow scheduling, improving the network capacity to 95% and achieving TB-scale data transmission within minutes.
Data express
Training based on local storage and remote computing
In industries that work with sensitive data, it is imperative that data not be transmitted to the storage zone in the intelligent computing center. Therefore, training based on local storage and remote computing is required. AFR works with Huawei's subscriber priority-based flow control (SPFC) technology to achieve a computing efficiency of 97%, fully ensuring the training efficiency of AI foundation models.
Training based on local storage and remote computing
- Author: Bai Hehui
- Updated on: 2025-07-09
- Views: 2428
- Average rating:
Export PDF