Search
Home Search Center IP Encyclopedia Online Courses

What Is Service Telemetry?

Service telemetry measures network latency for remote direct memory access (RDMA) services and provides online visualization. It applies to IPv4 RoCEv2 packets. This technology implements visualization of both input/output (I/O) quality and throughput. It measures the latency for network, storage, and compute nodes in a storage I/O operation by segment, and measures the transmission duration, effective throughput, and retransmission rate of RoCEv2 packets to monitor the network and demarcate problems.

Why Do We Need Service Telemetry?

As we enter the intelligence era, there are more and more services with massive data storage and read/write requirements. RDMA services face the following challenges to O&M:

  1. The network cannot proactively detect service performance deterioration or fluctuation caused by problems such as congestion. Instead, network faults are usually reported by the service department.
  2. When the storage I/O latency or input/output operations per second (IOPS) deteriorates, it is difficult to locate the fault.
  3. The NPU throughput cannot be measured due to the minute-level collection precision of interface statistics.
  4. The PFC statistics cannot reflect the degree of network congestion nor the impact on the throughput.
  5. It is difficult to detect and locate NIC and silent packet loss problems.
  6. Troubleshooting takes a long time due to the lack of scenario-based best practices for troubleshooting.

To address these challenges, Huawei launches the service telemetry technology. This technology breaks through limitations of traditional network monitoring and provides RDMA-based I/O quality visualization and throughput visualization. It accurately monitors and analyzes I/O latency and throughput data, as well as quickly detects storage service performance deterioration and network congestion. This facilitates network problem identification and network quality optimization, which in turn spurs the development of intelligent lossless networks.

How Does Service Telemetry Work?

I/O Quality Visualization

Service process

The following figure shows the layers involved in the service process of service telemetry.

Working process of service telemetry
Working process of service telemetry
  1. Analysis presentation layer (iMaster NCE-FabricInsight): Displays I/O-based performance indicators of service traffic and delivers configurations to devices through NETCONF interfaces.
  2. Device measurement layer (switches):
    • Compute-side port: Service packets enter or leave a measurement device through the compute-side port. The measurement device identifies specified packets, performs I/O latency measurement and breakdown, and reports the measurement result to the analyzer.
    • Storage-side port: Service packets enter or leave a measurement device through the storage-side port. The measurement device identifies specified packets, performs I/O latency measurement and breakdown, and reports the measurement result to the analyzer.

Latency breakdown solution

Based on the I/O interaction process, service telemetry can be used to match specified packets in transmit and return directions, define I/O latency breakdown objects, and measure the I/O latency. The following figure shows the latency breakdown solution.

Packet interaction in read and write I/Os
Packet interaction in read and write I/Os
In the preceding figure:
  • Data access latency (DAL): Used to locate problems on the storage side. DALs in read and write operations are measured separately.
  • Data preparation latency (DPL): Used to locate problems on the compute side. The DPL is only involved in the write operation.
  • I/O latency (IOL): Total latency on the compute/storage side.
  • Network round-trip time (RTT): They are different in read and write operations. iMaster NCE-FabricInsight calculates the network RTT using the following formula: RTT = IOL1 – IOL2.

Throughput Visualization

Service process

The following figure shows the layers involved in the service process of throughput visualization.

Throughput visualization system model
Throughput visualization system model
  1. Analysis presentation layer (iMaster NCE-FabricInsight): Displays throughput performance of service traffic and delivers configurations to devices through NETCONF interfaces.
  2. Device service measurement layer (switches): Service packets enter or leave Server B through Server A. After throughput visualization is enabled, Device A or Device B can identify RoCEv2 packets, measure throughput visualization indicators (time required for a single RDMA transmission, effective throughput of RDMA transmission, and ratio of retransmissions initiated by RDMA), and report the measurement result to the analyzer.

Throughput monitoring solution

The following figure shows the packet exchange process of a single RDMA transmission, where the sender sends RoCEv2 packets to the receiver through Device.

Packet exchange process
Packet exchange process

Throughput visualization can be used to analyze the following indicators:

  1. Flow Completion Time (FCT), in microseconds: FCT = Time when the last data packet is received by Device – Time when the first data packet is received by Device.
  2. Flow Effective Throughput (FET), indicating the effective throughput of RDMA transmission per second: FET (bit/s) = Effective throughput (bit)/FCT (microsecond) x 106.
  3. Flow NAK Rate (FNR), indicating the ratio of retransmissions initiated by RDMA: FNR = Number of retransmitted NAK packets/Number of RDMA messages (excluding retransmitted packets).

Typical Application Scenario of Service Telemetry

The following figure shows the typical application scenario of service telemetry. The service telemetry function can be enabled on switch ports. This function is deployed on the ports connecting to compute-side and storage-side servers and does not need to be deployed on the interconnection ports between switches.

Typical application scenario of service telemetry
Typical application scenario of service telemetry

The following table shows two modes commonly used in service application.

  

Routine Monitoring Mode

Maintenance or Key Assurance Mode

Deployment position

Single-point measurement (compute-side port)

Multi-point coordinated measurement (compute-side and storage-side ports)

Solution

Single-point measurement + port-based polling

The port-based polling solution is used to limit the number of packets sent to the CPU.

Multi-point measurement + interesting flow

The number of flows is reduced to limit the number of packets sent to the CPU.

Service indicator

  • Network RTT measurement: not supported
  • IOL, DPL, and DAL measurement on the compute leaf node: The DAL refers to the processing latency on the storage side and the network latency. If the DAL is faulty, it is suspected that an error occurs on the storage side.
  • Network RTT measurement: supported
  • IOL and DPL measurement on the compute leaf node and DAL measurement on the storage leaf node: The measurement result is more accurate.

Applicable scenario

Full-flow monitoring (time division multiplexing by interface group, full-flow instead of full-packet)

Full-process monitoring of interesting flows (full packets of interesting flows)

About This Topic
  • Author: Qian Jinchen, Yin Rongrong
  • Updated on: 2025-07-07
  • Views: 2463
  • Average rating:
Share link to