Back to the tech blog overview

Domain Correlation: just let the malware beat itself

Intro
This post is an introduction to our ‘Domain Correlation Engine’ and its strategic usage as an anti-malware weapon.

But before we get to that, here’s an important observation: Domain correlation is pretty cool. It’s so cool that we created a 3D visualization to demonstrate it:

Since the human vision is too limited to visualize the hundreds of dimension vectors in our model, we use the t-SNE visualization method to project them into 3D. Here we capture a single hour (Oct 28th, 17:00 UTC) of new core domain clusters. Each dot is a domain name. Same color dots are clusters. Information on highlighted clusters appears on the right.

The Model and the Machine
Domain correlation engine, aka ‘domain2vec’, is a neural network machine learning model inspired by the famous word2vec and other representation learning methods. In short, it learns [1] a vector of hundreds of dimensions for every domain name from a DNS query sequence list (aka ‘distributed representation’), as well as [2] the correlation between any two domain names by maximizing the sequence context probability.
To makes things simpler, here is an illustration of the model:




With the hundreds of billions of daily, real-time DNS queries we receive from our customers (and some powerful computing resources), we implemented this proprietary model and built an intelligent machine that correlates between any two domain names on the internet, in near real-time; given DNS queries as input, it learns the correlation without any human intervention.

Noise Reduction and Long-Tail-ability

Billions of queries, generated by millions of devices obviously create a lot of noise; The Domain2Vec model is robust against the noise in the sequence: for example, the noise created by the many legitimate ‘www.google.com‘ queries, which are also part of a malware C&C query sequence, is reduced through the model’s self-learning logic:

  • Malicious queries are likely to appear in benign sequences.
  • www.google.com‘ has low correlation score with malicious queries, compared to correlation with non-malicious queries.

The Domain2Vec model is also unique in its ability to correlate even unpopular, long tail domains; The model learns domain correlation from co-occurrence in a DNS query sequence, which can be considered a good approximation of “domain-client IP” bipartite graph. While other models may do that, domain2vec is significantly more sensitive to capturing the long-tail domain correlation (thanks to our undisclosed data augmentation technique…), so it can observe stealthier malware activities.

Now let’s put the machine to work.
While building sophisticated models and intelligent machines is fun, the application to real-world use cases is what generates the money (…and data science teams are expected to support the company’s bottom line). Fortunately, Domain2Vec and the Correlation Engine help solve an increasingly important problem – Cybercrime. Trojans, ransomware, adware, botnets – all use domain names and DNS to communicate with the attacker.

Considering the malicious activities as the signal and the benign activities as the noise, we apply the new core domain selection method as our loose signal selection (this was inspired by the common ‘data-driven’ method in high energy physics, by the way), and cluster the vectors of these domains learned by the domain2vec model. With a nice clustering algorithm, we group up domains of similar behavior into clusters.

Example: Necurs
Take for example this cluster:

ujvvevkxumflxckowoxe.ac.
kevqwpxsnxltrfjkjlcdu.so.
wmfwjrkrdlefjoibnpqj.bz.
njbrgyeoqjertgnfx.to.
uvtqccxpdgi.ga.
csuqymxhpwlsud.nf.
hdngnqrwttq.sc.
hqwcouvy.xxx.
vajhomipxfykqwg.xxx.
ebqwcqjguydhovifyn.ms.
…. (and hundreds of other names)

Based on our engine results, all the domains in this cluster can be identified as the notorious Necurs botnet C&C names.

What’s amazing here is that the model was able to discover Necurs malware domain groups in a completely unsupervised way (without any predefined Necurs malware knowledge or any human intervention). Moreover, by capturing Necurs’ “DNA sequence”, it now reveals all devices infected with Necurs and the queries they made. In other words, it lets Necurs beat itself.

(we should note that when tuning the hyperparameters and validating a cluster’s output, we examine if the model can re-discover some known domain generation algorithms (DGA), such as Necurs, Suppobox, Sphinx etc, which validates its discovery on many other unknown malicious domain clusters.)

Conclusions and beyond
Domain2Vec and the Correlation Engine are ‘responsible’ for the blocking of many botnets and malware strains in the past few years. Working with carriers worldwide helped us improve the model, and helped the carriers provide cleaner internet to their subscribers.

And the learning doesn’t stop: Domain2Vec keeps self-learning while being improved by its creators. As of today, it is able to correctly detecting 30-40 types on known malware families, as well as hundreds of unknown threats, not yet classified. Its accuracy gets better by the day. Since this engine learns the correlation between any domains, there are many other use cases beyond clustering: it can propagate the known threat scores to unknowns through the similarity network; it can extend the malware knowledge graph to a larger graph, and that’s just the beginning.

To further hear about its innovations, improvements and use cases, meet us next month at the Botconf Convention.

Back to the tech blog overview