IP Encyclopedia > Intelligent network O&M

What Is Intelligent Network O&M?

Huawei has launched the intelligent network O&M solution to visualize various O&M data and quickly detect, locate, and rectify faults. Additionally, this solution provides intelligent capabilities such as comprehensive health evaluation and fault prediction to achieve proactive protection based on exception detection and risk prediction, ensuring 24/7 service continuity.

Contents

Why Is Intelligent Network O&M Needed?

Digital transformation has become an inevitable trend in various industries. It is being accelerated thanks to the development of software technologies such as big data and machine learning. A wide range of services and applications will be migrated to the cloud, and enterprises tend to use and access the cloud on a daily basis. With the advent of the software-defined networking (SDN) and cloud computing era, compute and storage resource pooling makes enterprises' digital transformation easier. However, networks become more complex than ever, posing great challenges to network O&M, including:

Difficult to perceive services

Conventional networks are operated and maintained based on alarms. However, alarms on the live network are growing in scale, gradually going beyond the handling capability of the conventional methods that favor manual operations. One possible compromise is to filter less important alarms, but this leads to an incomplete grasp of the network health. As cutting-edge technologies, such as SDN, gain momentum on networks, O&M personnel need to operate and maintain both physical (underlay) and logical (overlay) networks, which cannot be performed based on only alarms.
Using conventional O&M methods, O&M personnel respond reactively to network faults, and cannot predict or prevent faults before they occur.

Difficult to locate faults

Large-scale network management: In cloud computing scenarios, O&M personnel need to manage not only physical devices but also virtual machines (VMs), involving dozens of times more managed NEs than before. In addition, as real-time analysis becomes a must-have, the O&M system needs to collect device data more frequently — often at intervals of milliseconds instead of minutes. This in turn increases the data volume almost 1000-fold. For proactive detection and troubleshooting of issues, the O&M system also needs to analyze and display masses of device data, further increasing the data scale.
Numerous service paths: To provide high reliability and bandwidth, a network is usually designed to forward traffic in load balancing mode so that traffic is steered via the hash algorithm. However, traffic forwarding paths increase exponentially as the number of network nodes increases, making it hard for administrators to determine the path through which traffic of a certain service is forwarded. Fault locating using conventional methods heavily depends on the experience of O&M personnel and is very time-consuming.

Slow fault rectification

The stable operation of networks plays a pivotal role in safeguarding enterprise information and promoting their business success. As a result, even the slightest interruption to a network will cause serious economic loss. As such, it is paramount that network faults are rectified before they cause any damage to enterprise services.
In financial service scenarios, centralized deployment is changed to distributed deployment, which is more complex. Passive response of O&M personnel prolongs the fault locating period to 76 minutes in average, making it hard to ensure service continuity.

To address the preceding challenges, Huawei launches the intelligent network O&M solution to implement accurate network O&M and continuously improve the manageability and service quality of networks.

What Are the Benefits of Intelligent Network O&M?

Comprehensive Network Health Evaluation, Implementing Real-Time Awareness of Services and Networks

The network health evaluation solution performs systematic network-level evaluation and monitoring, helping O&M personnel gain insights into the entire network and maximize O&M efficiency and service experience. To elaborate, this solution consists of three parts:

Network-level abstraction and modeling: Build a multi-layer evaluation system and periodically collect the status of network devices, protocols, connections, and services.
Comprehensive and intelligent network health evaluation: Build a network object model for each layer, and comprehensively collect network data including logs, performance data, network device configurations, and service flows exchanged between hosts. Use intelligent analysis algorithms to evaluate the health of each layer, dynamically detect exceptions of key indicators such as the device operating status and network capacity, and proactively predict network capacity and traffic risks.
GUI-based real-time visualization: Display the collected data in various forms such as charts in real time, and periodically generate network health evaluation reports, facilitating routine network health check and proactive troubleshooting.

Fast Fault Locating, Achieving Intelligent Diagnosis

It is difficult to locate and rectify faults quickly as the network is large in scale, configurations are complex, and configurations change frequently. Troubleshooting heavily depends on the experience of O&M personnel and is very time-consuming.

The intelligent network O&M solution can quickly locate the root causes of faults.

The in-situ Flow Information Telemetry (iFIT) technology is used to perform E2E hop-by-hop detection on poor-quality of experience (QoE) services (services are not interrupted but user experience is poor). With iFIT, an intelligent network controller collects information hop by hop and accurately locates failure points based on the collected information.
An executable troubleshooting task chain is orchestrated based on different fault patterns, enormous fault cases on the live network, and Huawei O&M expert experience, to shorten the fault locating and demarcation duration. For example, troubleshooting procedures are automatically orchestrated for service connectivity issues, enabling one-click automatic troubleshooting.
ERSPAN flows of devices and telemetry metrics are collected for big data analytics to proactively detect potential faults on fabrics and intelligently analyze and identify whether a network or application has group issues by combining AI algorithms. All these help users achieve the proactive and intelligent O&M goal for proactive fault detection and minute-level fault locating and demarcation.
AI algorithms can also be used to learn and infer unknown faults, helping O&M personnel deeply explore the root causes of these faults.

Automatic Fault Rectification, Ensuring Service Continuity

The intelligent network O&M system draws on a rule engine, intelligence engine, and knowledge graph to perform big data mining and analytics, thereby detecting and locating faults quickly. It can also work with a controller to implement fault rectification or isolation with just one click. This system can present network and service impact analysis for a given fault. Similarly, it can also display the impacts of a fault rectification or isolation plan on networks and services before delivering the plan. This greatly facilitates decision making.
For poor-QoE services, the intelligent network O&M system can automatically adjust service paths to avoid the links or nodes that cause poor quality, implementing automatic service SLA recovery.

Architectures of Intelligent Network O&M

Intelligent O&M is typically applied to data center networks and carrier networks. The following describes the architectures of intelligent network O&M in the two scenarios.

Intelligent O&M Architecture of Data Center Networks

The intelligent O&M architecture of data center networks is logically divided into three layers: network layer, control layer, and analysis layer, as shown in the following figure.

Network layer: consists of data center network devices that report mirrored packets, performance data, and logs to the analysis layer for further handling and presentation. The network layer can be considered as a data source of the analysis layer.
Control layer: is built on iMaster NCE-Fabric — an intelligent network management and control system. It interconnects with iMaster NCE-FabricInsight — an intelligent network analysis system — to automatically provision network services during O&M. iMaster NCE-Fabric can also interconnect with a cloud platform in the cloud-network scenario or with a virtual machine management (VMM) server in the network virtualization scenario to orchestrate logical networks and automatically translate and deliver network device configurations. Additionally, iMaster NCE-Fabric provides other various functions such as path detection, network reachability verification, intelligent fault discovery, fault locating, and fault rectification/isolation.
Analysis layer: is built on iMaster NCE-FabricInsight. Based on Huawei's big data platform, iMaster NCE-FabricInsight receives data from network devices through telemetry and uses intelligent algorithms to analyze and display the reported data. It can proactively detect network faults and locate their root causes in minutes, implementing intelligent O&M.

Intelligent O&M architecture of data center networks

Intelligent O&M Architecture of Carrier Networks

The intelligent O&M architecture of carrier networks is logically divided into three layers: data collection layer, data analysis layer, and data presentation layer, as shown in the following figure.

Data collection layer: iMaster NCE-IP — an intelligent network management and control system — delivers subscription messages to network devices, which then send running data, configuration data, and resource data to the data analysis layer in real time through network management protocols.
Data analysis layer: analyzes network data (device, connection, protocol, and security data) and service data in the following aspects:
- Analyzes network data to evaluate network health, and sends health analysis data and network risks to the data presentation layer, implementing proactive network O&M.
- Analyzes service data to identify poor-QoE services, reports information about these services to the data presentation layer, and automatically switches paths for services that require self-healing, implementing proactive fault detection and O&M.
- Performs correlation analysis on network data and service data, implements intelligent fault diagnosis based on AI big data analytics and expert experience, generates fault diagnosis reports, and sends the reports to the data presentation layer.
Data presentation layer: iMaster NCE-IP presents the received data analysis results in various forms, such as dashboards, charts, reports, and relationship diagrams. Additionally, northbound application programming interfaces (APIs) are provided for third-party systems to obtain data analysis results.

Intelligent O&M architecture of carrier networks

Application Scenarios of Intelligent Network O&M

Data Center Networks

Service change:
- Simulation and verification are implemented to evaluate whether the services to be delivered meet user expectations.
- Network changes are visualized in real time, and snapshot data and entry changes before and after device configuration changes are identified to assist in network status analysis.
- Full-lifecycle VM tracing is achieved to help network administrators quickly learn the distribution of online devices and properly plan resources.
- Configuration rollback is supported to rapidly restore services when faults occur, minimizing service interruption loss.
- Automatic server capacity expansion ensures fast service provisioning.
Routine preventive maintenance inspection (PMI):
- The network health is evaluated from multiple dimensions, including device, network, protocol, overlay, and service.
- Network-wide data, including configurations, entries, logs, and key performance indicators (KPIs), is collected using telemetry to detect issues and risks at each layer of the network in real time.
- The device operating status, network capacity, component health, and service interaction are inspected.
- Network performance exceptions are intelligently detected so that potential risks can be detected prior to services.
All these help O&M personnel comprehensively learn the network status and overall user experience.
Emergency fault rectification:
- Information about different types of faults on the network is collected and analyzed to gain insightful fault correlations from massive amounts of data, so as to quickly and accurately analyze and locate faults.
- The one-click fault remediation capability ensures continuous and stable service running.
Fault root cause locating:
- The knowledge graph inference engine is used to analyze the collected network data and quickly locate the root causes of faults.
- Unknown faults are learned and inferred, helping O&M personnel deeply explore the root causes of these faults.

Carrier Networks

Intelligent network O&M has been applied to the intelligent cloud-network solution for carriers.

For example, a carrier has applied intelligent O&M to their network. The application scenarios include:

Multi-dimensional VPN service quality presentation and proactive poor-QoE warning
- The intelligent network O&M system provides abnormal VPN KPI analysis, abnormal VPN traffic analysis, and access point KPI analysis.
- VPN customers can view SLA indicators such as the packet loss rate and delay of VPN services in real time and set poor-QoE thresholds. A warning is proactively generated if a threshold is exceeded.
Automatic and accurate fault demarcation and accurate fault trouble ticket dispatching
- Precisely display VPN service paths and demarcate faults hop by hop, helping O&M personnel quickly rectify faults.
- Support 7-day historical playback of VPN services, facilitating post-even fault analysis.

References

1(eBook) Hyper-Converged Data Center Network Intelligent O&M Solution

2Building an Intelligent Cloud-Network to Accelerate Industry Digital Transformation

3CloudFabric Data Center Network Solution Documentation Bookshelf

4CloudFabric Data Center Network Solution Product Documentation

About This Topic

Author： Zhang Yanlin, Jing Haili
Updated on： 2022-03-18
Views： 6154
Average rating：