Home Search Center Intelligent Model Selection IP Encyclopedia

What Is DPFR?

DPFR is a sub-millisecond-level fault recovery technology. It can quickly detect port faults based on the data plane, and work with functions such as local fast fault convergence, remote fault advertisement, and remote fast fault convergence to implement fast fault rectification, without affecting services. This document describes why DPFR is required, comparison between DPFR and traditional fault convergence technologies, how DPFR works, and a typical application of DPFR.

Why Do We Need DPFR?

Traditional route convergence technologies rely on information exchange and path recomputation being performed by dynamic routing protocols (such as OSPF and BGP) on the control plane. Although Bidirectional Forwarding Detection (BFD) can accelerate the detection of faults when they occur, route convergence still takes hundreds of milliseconds to complete, and can even take seconds for a large-scale data center network (DCN).

Online transaction applications, such as high-performance storage services and high-performance database access services, require ultimate performance and high reliability. For such services, taking hundreds of milliseconds to restore service transmission after a link fault occurs is unacceptable. Continuous packet loss may cause transaction failures or even connection timeout of the peer protocol stack. As a result, the application performance deteriorates significantly.

To solve this problem, DPFR is developed. It evolves from the traditional control plane-based fault convergence to the data plane-based fault convergence. Based on the data plane, it can perform fast fault detection, remote fault advertisement, and fast path switching, achieving sub-millisecond-level fault convergence and minimizing the impact on service performance. This technology provides higher reliability and stability for mission-critical applications such as high-performance databases, storage, and supercomputing.

Comparison Between DPFR and Traditional Fault Convergence Technologies

The following table compares DPFR and traditional fault convergence technologies on a large-scale DCN.

Table 1-1 Comparison between DPFR and traditional fault convergence technologies

Item

DPFR

FRR

BFD

ECMP

Fault detection time

Hundreds of microseconds

Hundreds of milliseconds

Milliseconds

Seconds

Fault convergence time

Sub-milliseconds

Hundreds of milliseconds

Seconds

Seconds

Fault convergence mode

Data plane

Control plane

Control plane

Control plane

How Does DPFR Work?

DPFR involves three important roles and their functions are as follows:

  • Fault detection node
    1. Fast fault detection: If an optical module on a port is faulty or a transmission optical cable is incorrectly connected, the data plane can quickly detect the faulty port. The data plane then samples outgoing traffic on the faulty port, determines information about the faulty flow, and generates the corresponding fault table.
    2. Local fast fault convergence: If the fault detection node has a redundant path available after querying the FIB table based on the faulty flow information, the data plane performs fast path switching for data packets before fault convergence on the control plane. (In this case, the fault detection node functions as the path switching node.)
    3. Remote fault advertisement: If the fault detection node has no redundant path available after querying the FIB table based on the faulty flow information, the data plane generates an advertisement packet and sends the packet to the upstream device.
  • Forwarding node
    1. Remote fault advertisement reception: The forwarding node records the port through which the fault advertisement packet is received, determines information about the faulty flow, and generates the corresponding fault table.
    2. Remote fault relay: If the forwarding node has no redundant path available after querying the FIB table based on the faulty flow information, it samples the faulty flow on the port, generates a fault advertisement packet, and sends the packet to the upstream device.
  • Path switching node
    1. Remote fault relay reception: This process is the same as that of remote fault advertisement reception.
    2. Remote fast fault convergence: This process is the same as that of local fast fault convergence.

The preceding nodes generate corresponding fault tables based on the faulty flow information. The entries in the fault tables will be aged out in a specified period to ensure that the behavior on the data plane is consistent with the route convergence result on the control plane.

Overall function analysis
Overall function analysis

Typical Application of DPFR

In the traditional Layer 3 networking shown in the following figure, servers are connected using independent IP addresses. Leaf switches are deployed as independent Layer 3 gateways to forward Layer 2 and Layer 3 traffic. Spine switches are deployed as independent Layer 3 devices and are connected to leaf switches to implement ECMP load balancing. This networking mainly applies to lossless scenarios such as high-performance computing (HPC), AI, and storage. For example, in HPC scenarios, a large number of packets are lost due to link faults. As a result, distributed computing tasks fail to be aggregated and need to be restarted, and application performance deteriorates significantly. DPFR shortens the packet loss time and ensures high reliability for core applications that require high performance, such as AI, machine learning, and HPC.

On a network where DPFR is deployed on all devices, if a spine switch or a link between the spine and leaf switches is faulty, the leaf switch can quickly switch traffic to another ECMP member link. If a link between the spine and leaf switches is faulty, the spine switch can instruct the remote leaf switch to switch traffic to another ECMP member link.

Traditional Layer 3 networking
Traditional Layer 3 networking
About This Topic
  • Author: Yang Xiaoli
  • Updated on: 2024-02-27
  • Views: 1365
  • Average rating:
Share link to