Hands-on production incident simulations for SREs, DevOps engineers, and technical founders.
Drop into a realistic terminal with a ticking clock and a system on fire. Run commands, find the root cause, and fix it before time runs out. No setup required.
Runs in your browser. Takes 5 minutes. The only production incident simulator you can start in seconds.
Deploy agents, coding assistants, and automated pipelines are running in production right now. When they fail, the blast radius is larger and the failure modes are stranger. These scenarios simulate what happens when the tools meant to help become the incident.

Multi-window incident workstation - Slack, Terminal, AI Agent logs, and live metrics in one view
Your AI deploy agent promised a 30-second rollback. That was 12 minutes ago. Pods are in mixed state, ECR auth expired, and the agent is stuck in a retry loop.
Production 500s spiking. Your AI coding assistant investigated, found a bug, and shipped a fix. Errors dropped - then came back worse. The AI fixed the wrong thing.

From CrashLoopBackOff to CoreDNS negative cache poisoning. Each scenario drops you into a realistic EKS-style console with real kubectl commands, live pod events, and a ticking clock. Built for SREs who want hands-on Kubernetes debugging practice across the full range of failure modes - not just the obvious ones.

terraform destroy on production.Based on the real DataTalksClub incident that hit the front page of Hacker News. Play through an authentic Claude Code split-panel interface - the same kind of setup that caused the original disaster.
What the DevOps community said when this happened in real life
"Claude Code wiped our production database with a Terraform command. It took down the DataTalksClub course platform and 2.5 years of submissions."
"Just as someone posts that Claude Code deleted a production environment via Terraform, we see 'all those annoying manual approvals need to go away'"
"I call BS on anyone who says they check every little thing their agent does. This will happen more, not less."
Live 3D architecture, real-time logs, real commands, and a ticking clock. Can you fix it before the money runs out?
The Situation Room - interactive 3D war room with live architecture, real commands, and a ticking clock
22+ scenarios based on incidents that actually took down production. New ones added every 2 weeks.

Your AI deploy agent promised a 30-second rollback. That was 12 minutes ago. ECR auth expired, pods in mixed state.

Your AI assistant fixed a bug. Errors dropped. Then came back 10x worse. The AI fixed the wrong thing.

Random 500s and slow page loads. The on-call engineer just quit.

3 AM. Mobile users can't connect. The website shows 'Your connection is not private'.

The entire application stack is crashing with write errors.

The API server is slowly consuming all available memory. Requests are timing out.

Checkout is completely broken. Payments can't process.

After deploying a new API version, the mobile app is crashing on launch.

Users are reporting 504 errors when trying to resize images.

Massive slowdowns after a Redis restart. Everything is hitting the database.

Pods are crashing on every deploy. CrashLoopBackOff, missing secrets, failed rollbacks. Debug it with real kubectl before it happens at 2 AM.

Pods keep restarting but never reach Ready. The liveness probe is the culprit.

Checkout pods OOMKilled but memory looks normal. The sidecar is hiding something.

Deployment stuck. Old pod won't die. Users getting duplicate notifications.

Some pods resolve DNS fine. Others get NXDOMAIN. Same cluster.

Three compounding failures from one maintenance window. Expert level.

All pods Pending. Three nodes, three different blocking reasons.

Database replica down. PV stuck on a terminated spot instance.

Sidecar stopped updating after cluster upgrade. Stale config in prod.

DNS failing intermittently. Cached NXDOMAIN from a brief service outage.

kubectl, logs, db queries - actual commands

Streaming logs, error spikes, gauges

Revenue drops, PagerDuty fires, pressure builds

Root cause, optimal path, what you missed
Every scenario you complete earns XP. Hit milestones to unlock pro scenarios for free.
Or skip the grind - Pro unlocks all scenarios instantly.
Sign up with GitHub or Google - takes 10 seconds

Most engineers and technical founders get paged cold with zero prior experience handling a real incident. Reading runbooks doesn't build on-call instincts. YouBrokeProd drops you into realistic incident simulations so when the real page comes in at 3 AM, you've already been there.
10+ scenarios across beginner, intermediate, and advanced. New ones every 2 weeks.

Read Postgres error states, diagnose connection pool saturation, and fix replication issues without guessing.

Diagnose CrashLoopBackOff, OOMKills, and missing secrets the way you would on a real EKS cluster - kubectl and nothing else.

Recognize credential exposure patterns, suspicious traffic, and misconfigurations that lead to real breaches.
Each scenario is a real-time simulation running in your browser. No setup. Just you, a terminal, and a production incident to solve.
Pick a scenario and difficulty. You get a briefing with symptoms, a simulated terminal, and a ticking clock.
Run real commands in the terminal - check logs, query metrics, inspect configs. Built-in hints if you get stuck.
Submit your root cause diagnosis, then apply the fix command. Scored on speed, accuracy, and efficiency.
See what you got right, what you missed, and the optimal diagnostic path. Compare your score on the leaderboard.
Run the same incident simulation across your SRE, platform, or founding engineering team. Compare scores, identify skill gaps, reduce MTTR, and build shared muscle memory for when the real pages come in. Manager reports and team leaderboards included.
Sign up free and start your first incident simulation in under a minute.
Start Your First Simulation