Home Search Center Intelligent Model Selection IP Encyclopedia

What Is DGA?

A domain generation algorithm (DGA) is an algorithm for generating domain names using a random sequence of characters, time-based elements, dictionary words, or hardcoding. Domain names generated by DGA are random and are usually used for connections between centralized botnets and C&C servers to evade domain name blacklists.

Harm of DGA

With the continuous development and iteration of Internet technologies, a plethora of malware has emerged. Nowadays, malware has become the public enemy number one for cyber security. It has become the preferred means for malicious actors to seek illegal gains through the Internet. In today's Internet landscape, illegal pornographic and gambling, or various fraudulent websites are rampant. Moreover, these illegitimate and lucrative domain names are generated randomly, making them difficult to effectively monitor using blacklists on communication devices in enterprise and community networks. The reason for this is the existence of DGA. Even if law enforcement agencies, technology companies, or hosting service providers block illegitimate domain names, malware can still use DGA technology to randomly generate domain names, allowing illegitimate websites to continue sending and receiving commands or sharing stolen data.

  • A large number of malicious domain names are quickly generated and cannot be effectively shielded

    Malicious actors can use DGA to generate thousands of malicious domain names every day. Network security devices cannot shield all malicious domain names by configuring blacklists.

  • High randomness, difficult to detect

    DGA is used by various malware families to generate a large number of pseudo-random domain names. Pseudo-random means that the string sequence seems to be random, but because its structure can be predetermined, it can be repeatedly generated and copied. Most randomly generated domain names do not actually have real domains attached to them. Only a small number of domain names are actually registered, but then they are used by hosts to communicate with servers to obtain data or trace other malicious tasks. In addition, any time a domain name is successfully blocked, the attacker just registers another one from a list of domain names generated by DGA. As a result, it is extremely difficult for network security devices to identify all of the malicious domain names there are.

  • Continuous parsing, camouflage and lurking
    Most DGA domain names cannot be accessed on the Internet because malicious actors cannot register so many domain names. However, malware authors use the same seed and algorithm to generate a domain name list that is the same as that of malware and select several of them for C&C servers. Malware continuously resolves these domain names, until it finds an available C&C server. This tactic makes it more difficult to block the malware.
    Hazardous process of DGA
    Hazardous process of DGA

DGA Classification

  • By Seed

    A seed is one of the input parameters used by an attacker to generate a domain name using a DGA. Different seeds can be used to obtain different DGA domain names. There are many types of seeds used by the DGA, including dates, popular words searched on social networks, random numbers, and dictionary words. The DGA generates a string of character prefixes based on seeds and adds top-level domains (TLDs) to obtain the final algorithmically generated domains (AGDs).

    Generally speaking, there are time-based and deterministic seeds:

    1. Time-based: The DGA uses time-based data as the input, such as the system time of the controlled host or an HTTP response time.
    2. Deterministic: The input for mainstream DGAs is fixed, allowing for the calculation of AGDs in advance. However, certain DGAs have uncertain inputs. For instance, the notorious malware Bedep makes use of foreign exchange reference rates published daily by the European Central Bank (ECB) as one of the seeds for the DGA; Torpig uses keywords from major social networking sites as seeds and only becomes active when a domain name is registered within a specific time window.
    DGA domain names can be classified into the following types based on the seed classification method:
    • TDD: time-dependent and deterministic
    • TDN: time-dependent and non-deterministic
    • TIN: time-independent and non-deterministic
    • TID: time-independent and deterministic
  • By Generation Scheme
    There are different DGA generation schemes:
    • Arithmetic-based: This scheme generates a group of values that can be represented by ASCII codes to form a DGA domain name. This is the most common scheme.
    • Hash-based: The DGA domain name is represented by the hexadecimal hash value. Common hash algorithms include MD5 and SHA-256.
    • Wordlist-based: In this scheme, words are selected from a dedicated dictionary and combined to reduce the randomness of domain name characters. The dictionary is embedded in malicious programs or extracted from public services.
    • Permutation-based: The characters used for an initial domain name are rearranged into different permutations of the original.

DGA Detection Methods

  • Supervised learning

    Common supervised learning algorithms include decision tree and random forest. The decision tree algorithm or random forest algorithm is used to identify DGA domain names.

  • Unsupervised learning

    Models based on decision trees and random forests rely on supervised learning and require certain features to work. One important advantage of unsupervised learning over supervised learning is that labeled datasets are not required. One well-known unsupervised learning algorithm is K-means. It is a simple and commonly used unsupervised learning algorithm widely used in DGA domain name detection.

  • Registration status

    The registration status includes whether a domain name is registered, when it was registered, and when the registration expires. You can determine the nature of a domain name based on the registration status of the domain name on the business platform, specifically, the payment status. High-risk domain names can be identified by creating a profile based on parameters such as when they were registered and how much money was paid.

  • Threat intelligence

    A threat intelligence platform and DGA dataset are used for detecting known DGA domain names.

  • Entropy-based

    In computing, entropy is a measure of the uncertainty associated with a random variable. It is a measure of how random a piece of information is. Typically, DGA domain names generated based on random algorithms have higher entropy than normal domain names. As such, DGA domain names can be classified by entropy.

  • Implicit Markov model

    The implicit Markov model classifies domain names by analyzing conversion probabilities between characters in a string. DGA domain names with high randomness do not comply with normal domain names in terms of statistical features. Therefore, this method can be used to detect DGA domain names.

  • Deep learning model

    A deep learning model uses a neural network trained on both known DGA domain names and normal domain names, resulting in a classifier that can accurately identify DGA domain names. The deep learning model may be less transparent and more challenging to troubleshoot, but it has proven to be more effective than traditional models. As a result, many products have adopted this approach.

About This Topic
  • Author: He Yan
  • Updated on: 2024-01-10
  • Views: 493
  • Average rating:
Share link to