What Is AIOps?
Artificial intelligence for IT operations (AIOps) is a process where artificial intelligence (AI) techniques are employed to accurately manage, control, and analyze massive O&M data in IT systems, thereby automating and optimizing O&M processes to boost efficiency and quality.
AIOps harnesses AI technologies like machine learning and deep learning to analyze and process O&M data. This enables AIOps to dynamically assess the health of O&M objectives, proactively identify faults, and detect potential issues, thus ensuring enhanced system availability and stability.
As AIOps becomes more integral in IT O&M, it will empower enterprises to accelerate their digital transformation, achieving greater efficiency and intelligence.
Why Is AIOps Required?
IT system O&M has undergone a transformation from manual O&M to automatic O&M, and is now evolving towards AIOps. In manual O&M, O&M experts rely on their experience to diagnose, locate, and resolve issues, resulting in low overall efficiency and requiring extensive expertise. Whereas, in automatic O&M, script- and tool-based automation significantly improves efficiency, but comes with the challenge of mastering complex tool development and iteration.
However, the accelerating pace of digitalization in today's society is driving unprecedented changes in lifestyles, social structures, and business models. And as a result, IT systems are becoming increasingly complex, with new technologies, architectures, and data types emerging rapidly, posing significant challenges to existing O&M capabilities. In light of overwhelming data, dynamic system status, diverse service applications, and varying configuration parameters, O&M engineers face significant challenges. Traditional manual and automatic O&M approaches fall short in managing large-scale IT systems.
To address this issue, Gartner proposed the concept of AIOps in 2016. The AIOps platform leverages technologies such as big data, machine learning (ML), and AI to automatically analyze and learn from massive O&M data, including historical, log, service, and system data. This enables AIOps to provide actionable insights and decision-making suggestions, enhancing or partially replacing existing O&M processes and operations. Therefore, AIOps can meet the O&M requirements of large-scale IT systems.
As enterprises and organizations continue to undergo digital and intelligent transformation, the need for AIOps persists and intensifies.
Benefits and Advantages of AIOps
AIOps encompasses the entire O&M lifecycle, spanning data collection, analysis, decision-making, execution, and predictive exception handling, which empowers O&M personnel to rapidly detect and precisely address IT system anomalies. AIOps offers several key benefits and advantages, including:
- Shortened MTTR
Mean time to repair (MTTR) is a critical metric for measuring system reliability and maintainability. It is normally used to assess fault rectification efficiency and maintenance team performance.
AIOps enables maintenance teams to perform correlation analysis, big data processing, and inference on effective O&M data from multiple IT systems. Compared to manual or automatic O&M approaches, AIOps enables faster and more accurate fault detection, locating, and troubleshooting, thereby significantly reducing MTTR.
- Upgrade from reactive O&M to proactive O&M
In traditional O&M practices, when an exception arises in the IT system, the O&M team relies on experience or tool-based analysis to identify the fault, verify its occurrence, and subsequently perform troubleshooting. However, the fault may have potentially impacted upper-layer application performance. Moreover, O&M teams may struggle to detect exceptions in the first place, often only becoming aware of issues after the service team reports a fault, at which point they initiate further fault locating.
AIOps uses capabilities such as big data analytics and machine learning to implement predictive O&M. By analyzing historical data and current trends, potential exceptions can be identified before they cause service disruptions. As machine learning data accumulates and iterates, predictions become increasingly accurate. By tracking indicator trends, exceptions can be anticipated before they occur, even in normal service environments. In this way, the O&M team can proactively perform maintenance actions in advance to eliminate faults in the early stage and ensure long-term stable service running.
- Lower O&M costs
The initial investment in an AIOps system is substantial, but the long-term benefits far outweigh this initial capital injection. By automating routine maintenance tasks, organizations can free up personnel to focus on more strategic and innovative work. Over time, the AIOps system results in lower operating costs and improved cost-effectiveness for the organization as a whole.
Technical Features of AIOps
According to Gartner, AIOps products or platforms mainly include the following five types of technical elements:
- Data source: bottom-layer record data from each IT infrastructure.
- Big data platform: processes and analyzes static and dynamic real-time data.
- Computing and analysis: data preprocessing, data standardization, and other data cleaning work.
- Algorithm: used for computing and analysis to generate results required in IT O&M scenarios.
- Machine learning: includes unsupervised, supervised, and semi-supervised learning.
On the whole, the key competitiveness of AIOps is reflected in three aspects: AI-based core algorithm capability, seamless integration with IT systems, and diverse data integration capability.
Algorithms are the core capabilities of AIOps. Currently, AIOps algorithms focus on exception detection, prediction, and root cause analysis. The main technical trends are as follows:
- Exception detection technology: Traditional approaches typically rely on supervised algorithms, whereas AIOps utilizes a combination of supervised and unsupervised algorithms.
- Prediction technology: evolves from traditional machine learning to deep learning, represented by Long Short Term Memory (LSTM).
- Root cause analysis technology: Traditionally, correlation rules and unsupervised algorithms are used. However, it is a new trend to use the knowledge graph algorithm for root cause analysis.
What Are the Differences Between DevOps and AIOps?
DevOps is a combination of software development (Dev) and IT operations (Ops). It is a set of processes, methods, and systems that emphasize collaboration and communication between software development (applications or software engineering), technical operations, and quality assurance (QA) departments. By automating the process of "software delivery" and "architecture change", software can be built, tested, and released more quickly, frequently, and reliably.
Both DevOps and AIOps are methodologies used to optimize software development and O&M. However, they differ in some ways:
- DevOps aims to optimize the collaboration and automation between development and O&M teams, in order to accelerate software delivery speed and quality, while fostering teamwork and continuous integration.
- AIOps leverages advanced technologies such as AI and ML to optimize O&M processes, enabling intelligent O&M management through data analysis and inference. AIOps emphasizes real-time fault detection, automatic and intelligent troubleshooting, as well as resource optimization.
In a nutshell, DevOps focuses on optimizing the software delivery process. However, one of the main caveats is that it relies on human-defined instructions and processes, limiting its ability to adapt to new problems independently. In contrast, AIOps empowers intelligent O&M capabilities, enabling autonomous problem-solving and decision-making through advanced technologies such as AI and ML.
Huawei's AIOps Platform and Solution in the Data Communication Field
In the data communication field, Huawei provides higher-level autonomous driving network solutions through iMaster NCE series products, spanning the entire lifecycle of various networks, from planning and construction to maintenance and optimization. There is no doubt that AIOps is a key enabler of these solutions.
Take Huawei's data center network as an example. The key capabilities implemented in the AIOps phase include but are not limited to the following:
- Change and capacity expansion: identification of service change intents, automatic recommendation of change solutions, simulation and evaluation before change delivery, on-demand rollback after change delivery, and automatic generation of acceptance reports.
- Monitoring: automatic creation of monitoring tasks based on service views for continuous monitoring.
- Troubleshooting: real-time exception detection, problem identification within 1 minute, automatic root cause analysis, recommendation of the optimal rectification solution, and prediction of software and hardware faults.
- Parameter adjustment and optimization: automatic adjustment of devices' internal queues by using the traffic model to achieve zero packet loss; establishment of dynamic service quality baselines to predict service deterioration in advance.
As shown in the following figure, iMaster NCE is an automated and intelligent platform that integrates management, control, and analysis, serving as the brain of data center autonomous driving networks. The intent engine, automation engine, analytics engine, intelligence engine, and network digital twin base help implement high automation and intelligent O&M throughout the lifecycle of data center networks.
Logical architecture of iMaster NCE (data center autonomous driving network management and control system)
The core component of AIOps is iMaster NCE's analytics engine, which establishes a unified framework for fault detection, root cause analysis and intelligent inference, as well as troubleshooting and maintenance. By leveraging big data technology, the engine can collect and analyze massive device data, detect device KPIs, status, and entry changes in real time, and support full-flow collection and analysis. This engine consists of health check, anomaly detection, and root cause analysis.
- Health check
Abstracts and models network KPIs, traffic, and status indicators, establishes a network health check system oriented to devices, networks, protocols, and services, and comprehensively evaluates network health status in real time from multiple dimensions, such as performance, capacity, status, security attacks, and connectivity.
- Anomaly detection
Proactively predicts faults that have not occurred and quickly detects network exceptions and faults that have occurred based on network health evaluation.
- Root cause analysis
By leveraging knowledge graphs for in-depth feature extraction and learning, and combining this with troubleshooting and configuration entry comparison methods, the root causes of network faults can be quickly located. After the root cause is identified, the impact of the fault is analyzed and a preferred troubleshooting solution is recommended. Additionally, potential fault risks can be identified and analyzed in advance using data such as network traffic, and proactive optimization can be performed to eliminate potential network risks.
- Author: Zhang Fan
- Updated on: 2024-08-05
- Views: 1386
- Average rating: