KubernetesAdvancedPar time: 8:00

The DNS Cache Lie

Payment service DNS lookups are failing intermittently. The service exists and endpoints are healthy.

The Scenario

Ninety seconds ago a platform engineer ran a migration script that deleted and recreated the payment-service Kubernetes Service object. During the 2-second gap between delete and recreate, CoreDNS received several DNS queries and cached NXDOMAIN responses for payment-service.production.svc.cluster.local. Now checkout is failing on 47% of requests - exactly the requests that land on pods whose local DNS resolver is using a CoreDNS replica that cached the stale negative response. The other CoreDNS replicas have the correct A record.

What You'll Learn

1

How CoreDNS negative response caching (NXDOMAIN TTL) creates split-brain DNS

2

Why deleting and recreating a Service causes a temporary DNS poisoning window

3

Using kubectl exec to test DNS from individual pods and correlate to CoreDNS replicas

4

Flushing CoreDNS cache or using the reload plugin to recover from stale NXDOMAIN

Tools You'll Use

kubectlCoreDNS logsnslookup in-podService spec

Real-World Context

Service delete-recreate patterns - common in migration scripts and blue-green deployments - create brief DNS poisoning windows. The resulting split-brain is hard to diagnose because all services appear healthy from the outside.

Ready to debug this?

Free account required - sign up with GitHub or Google in 10 seconds

Play The DNS Cache Lie