How DNS Optimisation Improved Peak-Traffic Stability

“It's always DNS: How a DNS overhaul stopped our provider request timeouts”

By Milan, Team Lead, DevOps

In iGaming, peak traffic is not an edge case. It is the moment operators build for. A major provider release, a successful campaign, a tournament surge, or a high-stakes sporting event can all create sharp spikes in player activity. When that happens, platform reliability depends on far more than adding compute. One of the less visible but more critical contributors to performance is DNS — the system services use to find and talk to each other across the platform.

The symptom we kept seeing was specific. Casino and betting providers operate with a 3-second timeout window on requests into our platform. Under load, we were getting recurring reports of provider requests timing out — not because the application could not handle the work, but because internal DNS resolution was eating into the budget before the request ever reached business logic. The same pattern showed up in our own service logs as intermittent database connection issues and flaky inter-service calls. All of it pointed to the same place.

Why DNS becomes a bottleneck under burst load

In a Kubernetes cluster, every service-to-service call typically begins with a DNS query. At low concurrency this is cheap. Under burst load, three things change at once. First, the volume of queries hitting cluster DNS pods rises sharply. Second, short-name lookups like rabbitmq.rabbitmq.svc are expanded against the cluster search path (default.svc.cluster.local, svc.cluster.local, cluster.local, and so on), so a single logical lookup can generate four or five actual DNS queries before resolving. Third, because cluster DNS traffic is UDP by default, retries and conntrack pressure on the node compound the problem.

None of these are visible at normal load. All of them become visible at peak — and once DNS resolution starts eating hundreds of milliseconds, a 3-second provider timeout is suddenly not as generous as it sounds.

What we changed

We addressed the bottleneck on three fronts.

CoreDNS tuning. We revisited our CoreDNS deployment end to end: caching strategy and cache sizing, resource requests and limits, replica counts and autoscaling behaviour, and the deployment configuration itself. Default settings are reasonable starting points, but they are not tuned for the specific concurrency patterns of a high-volume iGaming workload.
NodeLocal DNS Cache. We introduced NodeLocal DNS Cache as a per-node DNS caching layer and pointed the cluster's resolver at it. Most DNS queries are now answered locally on the node where the workload runs, rather than traversing the network to a central CoreDNS pod. It also lets us use TCP upstream for cache misses, which removes a class of UDP conntrack failures that show up under sustained burst load.
Fully qualified internal domains. We moved cluster-wide to fully qualified internal names — for example rabbitmq.rabbitmq.svc.cluster.local instead of rabbitmq.rabbitmq.svc. By default, the cluster resolver appends the search path to short names, generating multiple lookups per call. Using fully qualified names collapses that into a single query. On its own, this change produced a roughly 5x improvement in observed DNS behaviour under load.

In parallel, we reviewed the relevant Oracle Cloud Infrastructure networking configuration to make sure nothing in the underlying platform was acting as a hidden limiter once DNS throughput improved.

Results in production

We validated the changes in staging across several weeks of burst-traffic load testing, then rolled them out to production with monitoring in place.

The results were clear. Provider timeout reports — the recurring issue that triggered the investigation — have stopped. We are no longer receiving complaints about requests exceeding the 3-second window due to internal resolution latency. Just as telling, a number of our own internal services that had been reporting intermittent database connection issues and similar transient errors have gone quiet. These were never logged as DNS problems. They presented as flaky dependencies. Removing the DNS bottleneck removed them too.

This is the part of infrastructure work that is hardest to see from the outside. A single underlying issue can surface as half a dozen unrelated-looking symptoms across the platform. Fixing the root cause makes a long tail of small reliability issues disappear at the same time.

Building for the moments that matter

The lesson behind this work is the one the meme captures: it's always DNS. Reliability is not only about compute. It is about the unglamorous dependencies that only surface under real peak load, and that often present as something else entirely. Default cloud and Kubernetes settings carry a platform a long way, but at scale, the smallest infrastructure dependency can become a critical-path issue when concurrency spikes.

At Vyking, our DevOps focus is not only to keep systems running. It is to make sure they are ready for the moments when performance matters most.