What Is NSLB?
AI training flows are small in quantity but large in data volume per flow. The conventional hash algorithm is extremely prone to load imbalance, leading to lower training efficiency. To address this issue, Huawei launches the Network Scale Load Balance (NSLB) algorithm. This algorithm implements network-wide load balancing through intelligent optimization, ensuring high network throughput and unleashing high computing power in the AI era.
Why Do We Need NSLB?
In Parallel Computing Mode of AI Foundation Models, the Biggest Challenge for High Performance Is Load Imbalance
Compared with general-purpose computing, AI foundation model training requires more processors to participate in parallel computing. The industry has proposed the following parallel computing modes:
- Data Parallelism (DP): partitions training datasets for parallel training. This approach can reduce the training duration.
- Pipeline Parallelism (PP): deploys different layers of a model on different GPUs. This can lower the GPU memory requirements of foundation model computing.
- Tensor Parallelism (TP): splits a tensor into N chunks so that each GPU processes only 1/N chunk of the tensor when the memory of a single GPU cannot support foundation model computing. This approach significantly reduces the parameter quantity of each GPU, allowing a larger model to be trained without adding GPUs.
In foundation model training, multiple parallel modes including DP, PP, and TP are used in combination to make full use of the computing power of the cluster. No matter which parallel mode is used, AllReduce collective communication is involved between multiple machines. An AllReduce task contains multiple point-to-point communication tasks, and can be done only when all of its point-to-point communication tasks are successfully done. As such, the collective communication involves the bucket effect, in which the completion time of AllReduce is determined by the time spent in slowest point-to-point communication.
According to the bucket effect, as long as one link on the network is congested due to unbalanced loads, this link becomes a short stave of the bucket and the collective communication time increases greatly, even if other links can smoothly forward traffic. The current load balancing technology is based on hash randomization, and can only implement approximate load balancing when there are a large number of flows. Therefore, the key to improving the training efficiency of AI foundation models is to solve the load imbalance problem.
Characteristics of Reduce and AllReduce collective communication
AI Training Scenarios Feature Small Flow Quantity and Large Data Volume Per Flow, Prone to Load Imbalance with the Conventional Hash Algorithm
Compared with general-purpose computing, AI training produces a small number of flows that are large in data volume per flow. Typically, general-purpose computing uses short connections, and each server can have hundreds of thousands of flows, while AI servers use persistent connections, and each GPU has just hundreds of flows. General-purpose computing mainly produces mice flows (KB/MB), whereas AI servers mainly produce elephant flows (GB).
The conventional equal-cost multi-path (ECMP) mechanism is created to cope with the general-purpose computing tasks producing large numbers of flows that are small in data volume per flow. When it comes to the AI training scenario, the ECMP mechanism can easily cause load imbalance over links, in which some links are running with their maximum throughput, while other links are idle.
Load imbalance using the conventional hash algorithm
How Does NSLB Work?
To solve the problem of load imbalance, Huawei launches the NSLB algorithm. When NSLB works with NPUs, the iMaster NCE controller proactively obtains or parses AI traffic communication relationships from a global perspective, and then computes paths and delivers configurations uniformly. In this way, it eliminates link conflicts across the entire network. When NSLB works with GPUs, the network can proactively detect congestion and automatically switch paths to implement network-wide load balancing.
Implementation of NSLB working with NPUs
Implementation of NSLB working with GPUs
Typical Application of NSLB
In recent years, AI algorithms have entered the era of foundation models with trillions of parameters, and the computing power requirements have increased by nearly 100,000 times. Large-scale AI computing requires efficient collaboration between tens of thousands of AI processors, demanding continuous network optimization to improve parallel computing efficiency. In addition, due to the high cost of AI processors, a high-performance network with zero packet loss and high throughput is urgently needed to fully unleash the efficiency of AI processors. AI training mainly produces elephant flows, which are small in quantity and large in data volume per flow. As a result, the conventional network is prone to load imbalance. The slowest flow determines the network performance. That is, the next round of communication can start only after the slowest flow in the current round reaches its destination.
To address the preceding challenges, Huawei launches the Xinghe Intelligent Computing Network Solution, a new data center network solution that features ultra-large scale, ultra-high throughput, long-term stability, and reliability in the intelligent era. With the exclusive NSLB algorithm, this solution improves the training efficiency by 20% and fully unleashes the AI computing power.
- Author: Zhang Yanlin
- Updated on: 2024-08-30
- Views: 1302
- Average rating: