Ten years ago everyone evaluating DNS solutions was always concerned about performance. Broadband networks were getting faster, providers were serving more users, and web pages and applications increasingly stressed the DNS. Viruses were a factor too as they could rapidly become the straw that broke the camel’s back of a large ISP’s DNS servers. The last thing a provider needed was a bottleneck, so DNS resolution speed became more and more visible, and performance was everything.
A lot has changed. Now most providers focus on properly securing their DNS (there’s a great post on that topic here) protecting their networks from bots and even ensuring the safety of their subscribers. More and more providers also recognize the opportunity to take more of an architectural approach to deploying DNS and other services to improve network efficiency, maximize agility, and ensure differentiation.
Back to performance. It’s still a factor, but it’s important to understand how to assess the performance of a DNS resolver because superficial performance evaluations will result in unpleasant surprises when servers fail under load. Vendors tend to lead with the best possible performance numbers, which don’t reflect real world operating conditions. For instance the easiest test, which will yield the highest performance numbers, involves sending a single query at a resolver, over and over again. How often does that happen in the real world?
The most important thing to test is recursion, because it allows a resolver to find answers not in its cache. Under real world conditions entries age out and queries come in requesting domains that have not been cached. Typically a resolver will answer about 80% of queries from cache, although the percentage can vary substantially, so it’s necessary to test a range of cache hit rates (the percentage of incoming queries that match an entry in the cache) to understand how a resolver will behave under the wide range of operating conditions it will encounter in production.
The extreme case of recursive testing is a “cold” cache test where none of the incoming queries match entries in the cache (0% cache hit rate). A cold cache test is often dismissed as “not real world” since a cache becomes populated and “warms” up very quickly; but there’s a very important reason why it must be evaluated. It’s simple to create DNS queries that will not be populated in a cache – and attackers can trivially exploit this technique to force a resolver to do more work (handle recursive queries). Under these attacks servers that handle recursion poorly will crash, or suffer heavy packet loss which compromises their query handling performance and thus the end user experience.
It’s worth asking questions when vendors promote “cold cache” performance to make sure there’s agreement about what they mean. A cold cache test meansevery query forces the DNS server to perform external resolution. It most definitely does not mean starting with a cold cache and then querying the server, allowing it to build up its cache over time!
The DNS is still an essential part of the Internet and resolver performance is important, but savvy providers know there’s a lot more. They also know performance testing has to go beyond simple drag races that measure how well a server responds to a single query repeated in rapid succession. Resolvers have to be subjected to rigorous recursive tests to ensure they can withstand the wide range of subscriber traffic they’ll encounter, as well as floods of malicious queries.
Bottom line, DNS resolvers have to offer massively high performance under every possible operating scenario, and be highly reliable and secure to withstand all the badness sent their way (DDoS, cache poisoning). On the most fundamental level they have to withstand every variant of real-world traffic, including malicious traffic, and continue to provide a critical network service no matter what.