Data scientists put in a tireless amount of work tracking cybercriminals—from specific individuals to entire organizations—looking at their behavior and the methods through which they attempt to compromise data. Because DNS is a ubiquitous protocol that’s used for most internet interactions, it also provides fertile ground for cybercriminals to launch malware. Nominum Data Science examines massive volumes of DNS data—100 billion queries daily—to detect anomalies and uncover the patterns of malicious code authors before other security experts.
By examining domains and queries against domains, Nominum Data Science uncovers the footprints left by cybercriminals. Though not easy, thorough examination of “long-tail” domains provides unique insight into how malware is distributed on the internet. Long-tail, from a statistical standpoint, refers to the portion of a distribution—in this case domain queries—that has a large number of occurrences far from the central (or “head”) part of the distribution.
The chart below shows the number of queries for each unique core domain in a single day of data, ranked from most queried on the left, to least queried on the right. There are a small number of domains on the left that predominate with very high query counts, followed by a long-tail of domains, all of which receive very few queries.
The Full Scope of the Long-Tail
The long-tail starts at the apex of the curve in the chart where additional statistical analysis approximates at the domain ranked 4,000 (the 4,000th most queried unique core domain on this day). This means the top 4,000 domains queried comprise the “head” of the chart—a relatively small number of names for which there were many queries. The bottom 29 million domains make up the long-tail. This is where our data scientists see a lot of cybercriminal activity occurring.
If the top 4,000 domains were charted over 1/8 of an inch (which is more than 100 times the resolution of a high-resolution display), it would take eight more feet of the chart to show the remaining long-tail domains. It is critical that organizations carefully evaluate the data there.
Let’s take a quick look at each part of the overall dataset for the top 4,000 core domains:
- Only 0.14% of the total number of unique core domains are in the top 4,000
- They were queried a total of 77 billion times, which is 93% of all queries in the dataset
- The number one domain, “google.com”, was queried 6 billion times
- On average, the top 4,000 domains were queried 19 million times each day
In the top 4,000, accurately identifying threats is critical, since inadvertent placement of a popular domain on a block list (a false positive) is extremely costly. The top 4,000 domains are queried anywhere from slightly less than 900,000 to six billion times per day.
Long-Tail Queries Under the Radar
A vast majority (99.86%) of total unique core domains (29 million) make up the long-tail. These domains were queried a total of five billion times (6.55% of all queries) on this day.
The DNS querying characteristics of the long-tail domains are quite different from domains in the top 4,000.
- The top-ranked domain in the long-tail (rank 4,001), “cliqz.com,” was queried 817,448 times
- On average, long-tail domains were queried 18 times each, ~one millionth as much as the top 4,000 domains
- 11 million long-tail domains (more than one third) were only queried one time
In the long-tail, a different set of data science problems applies. In contrast with DDoS or high-ranked malware with few domains and lots of queries, some malware may use thousands of different core domains, and there might only be a few queries for each one. Or, it might only use a very small number of names, with a relatively modest number of queries. Faint signals, or evidence of malicious activity, must be found in the midst of considerable noise.
The Morphing Nature of the Long-Tail
Complicating matters even more, malicious domains change day-by-day and even hour-by-hour. Failure to identify a malicious domain means an exploit goes unchecked. Algorithms must be extremely accurate to meet these challenges.
In the chart below you’ll notice the difference in total malware between the head and the long-tail:
|The Head (top domains)||The Long-Tail||Observation|
|Total domains||4,000||29 million||Many domains are generated with malicious intent by cyber criminals.|
|% of total unique core domains||.14%||99.86%||Though vast, the long-tail must be examined to uncover patterns.|
|Total average queries||19 million times||18 times||Domain Generation Algorithms (DGAs) create voluminous amounts of domains; many are never used or are waiting to be used for malicious activity.|
|Total malware||58||198||The Long-tail is a hot bed for cybercriminal activity.|
|% of total queries||93%||6.55%||Domains that are infrequently queried require additional analysis such as domain name length and nonsensical names.|
Malware Distribution in the Long-tail
- 198 malware families were observed (compared with 58 in the top 4,000)
- 137 malware families (69%) used less than five names, and 69 malware families (35%) used just one domain name
- The 137 malware families that used less than five domain names sent five million queries
- The 69 malware families that used less than one domain name sent 1,708,459 queries
This data reveals a lot about the approach taken by malware creators to avoid detection by flying under the radar; although there is much more data to sift through, there are more potential attacks dispersed throughout. The goal of Nominum Data Science, a team of security experts who use algorithms to process nearly 4 TB of DNS data per day, is to make the radar extremely sensitive, so even the slightest signals are detected and blocked.
Once malicious domain names, or “target names,” are recognized, they are fed into the Global Intelligence Xchange (GIX), a dynamic threat list created by Nominum Data Science. Nominum’s flagship security software, N2 ThreatAvert, incorporates GIX, which dynamically updates servers with the latest target names, as well as carefully configured filters, to block malicious queries—as well as protect good queries—before such malicious activities become a problem.