Production Incident Training
That Feels Real

Practice diagnosing and fixing production failures in your browser with real commands, logs, and pressure.

Train on Kubernetes, databases, Terraform, DNS, and AI-driven production failures with terminals, live telemetry, and debriefs. No setup required.

Start Playing Free See Pricing

Free account - 10 seconds with GitHub or Google

Runs in your browser. Takes 5 minutes. Built for incident response training, team drills, and on-call practice.

620+ incidents simulated

227+ registered engineers

New

When AI Breaks Production

Deploy agents, coding assistants, and automated pipelines are running in production right now. When they fail, the blast radius is larger and the failure modes are stranger. These scenarios simulate what happens when the tools meant to help become the incident.

Multi-window workstation UI - Slack, Terminal, AI Agent Stream, and Grafana dashboards in a single incident view

Multi-window incident workstation - Slack, Terminal, AI Agent logs, and live metrics in one view

Agentic - BeginnerPlay Free

30 Seconds, They Said

Your AI deploy agent promised a 30-second rollback. That was 12 minutes ago. Pods are in mixed state, ECR auth expired, and the agent is stuck in a retry loop.

Multi-window workstation UI

Play now →

Agentic - IntermediatePro

The Fix That Wasn't

Production 500s spiking. Your AI coding assistant investigated, found a bug, and shipped a fix. Errors dropped - then came back worse. The AI fixed the wrong thing.

Multi-window workstation UI

60s preview available on Pro

Cluster OpsPro

Kubernetes SRE Professional Pack
10 Kubernetes scenarios covering probes, OOM, DNS, rollouts, storage, RBAC, and multi-layer failures.

From CrashLoopBackOff to CoreDNS negative cache poisoning. Each scenario drops you into a realistic EKS-style console with real kubectl commands, live pod events, and a ticking clock. Built for SREs who want hands-on Kubernetes debugging practice across the full range of failure modes - not just the obvious ones.

View the PackSign up to play - Pro scenarios

Trending685K+ views - covered by Tom's Hardware, Hacker News, and more

An AI agent ran `terraform destroy` on production.
2.5 years of data - gone. Can you recover it?

Based on the real DataTalksClub incident that hit the front page of Hacker News. Play through an authentic Claude Code split-panel interface - the same kind of setup that caused the original disaster.

What the DevOps community said when this happened in real life

Alexey Grigorev

@Al_Grigor

"Claude Code wiped our production database with a Terraform command. It took down the DataTalksClub course platform and 2.5 years of submissions."

685K+ views

Christoph Engelbert

@noctarius2k

"Just as someone posts that Claude Code deleted a production environment via Terraform, we see 'all those annoying manual approvals need to go away'"

Hacker News

Front Page

"I call BS on anyone who says they check every little thing their agent does. This will happen more, not less."

500+ comments

It's Cyber Monday. Payments just stopped. You're on call.

Live 3D architecture, real-time logs, real commands, and a ticking clock. Can you fix it before the money runs out?

Play the Situation Room

The Situation Room - interactive 3D war room with live architecture, real commands, and a ticking clock

New - MultiplayerTeams live

War Room - train your whole team on the same incident

Create a shared incident room, bring your team in with a join code, and race through the same outage together. See who diagnosed it fastest, who got stuck, and who found the root cause first. Self-serve Teams up to 25 seats.

Create War Room Got a code? Join a room →

Now playable inside Claude and ChatGPTNew

Connect your AI assistant and solve production incidents right from the chat. Full scoring, XP, and badges sync to your profile.

Set up MCP access

Not a Tutorial. Not a Quiz.
A Real-Time Incident Simulation.

22+ scenarios based on incidents that actually took down production. New ones added every 2 weeks.

Agentic Incidents

When AI tools break production

AgenticBeginner

30 Seconds, They Said

Your AI deploy agent promised a 30-second rollback. That was 12 minutes ago. ECR auth expired, pods in mixed state.

AgenticIntermediate

The Fix That Wasn't

Your AI assistant fixed a bug. Errors dropped. Then came back 10x worse. The AI fixed the wrong thing.

DatabaseBeginner

The Mysterious Timeout

Random 500s and slow page loads. The on-call engineer just quit.

SecurityBeginner

The Expired Certificate

3 AM. Mobile users can't connect. The website shows 'Your connection is not private'.

InfrastructureBeginner

No Space Left on Device

The entire application stack is crashing with write errors.

ApplicationIntermediate

The Vanishing RAM

The API server is slowly consuming all available memory. Requests are timing out.

NetworkingIntermediate

Situation Room: DNS Failure

Checkout is completely broken. Payments can't process.

MobileIntermediate

The Mobile Apocalypse

After deploying a new API version, the mobile app is crashing on launch.

ServerlessAdvanced

Cold Start Hell

Users are reporting 504 errors when trying to resize images.

CachingAdvanced

Cache Stampede Chaos

Massive slowdowns after a Redis restart. Everything is hitting the database.

Kubernetes SRE Professional Pack

10 scenarios - probes, OOM, DNS, rollouts, storage, RBAC, and multi-layer failures.

KubernetesIntermediate

Cluster Ops: The Infinite Restart Loop

Pods are crashing on every deploy. CrashLoopBackOff, missing secrets, failed rollbacks. Debug it with real kubectl before it happens at 2 AM.

KubernetesHard

The Silent Killer Probe

Pods keep restarting but never reach Ready. The liveness probe is the culprit.

KubernetesHard

The OOM That Lies

Checkout pods OOMKilled but memory looks normal. The sidecar is hiding something.

KubernetesMedium

The Phantom Rollout

Deployment stuck. Old pod won't die. Users getting duplicate notifications.

KubernetesHard

The DNS That Only Works Sometimes

Some pods resolve DNS fine. Others get NXDOMAIN. Same cluster.

KubernetesExpert

The 3 AM Page

Three compounding failures from one maintenance window. Expert level.

KubernetesMedium

The Pods That Won't Land

All pods Pending. Three nodes, three different blocking reasons.

KubernetesMedium

The Volume That Won't Let Go

Database replica down. PV stuck on a terminated spot instance.

KubernetesHard

The Service Account That Lost Its Power

Sidecar stopped updating after cluster upgrade. Stale config in prod.

KubernetesHard

CoreDNS Negative Cache Poison

DNS failing intermittently. Cached NXDOMAIN from a brief service outage.

Real Terminal

kubectl, logs, db queries - actual commands

Live Metrics

Streaming logs, error spikes, gauges

Ticking Clock

Revenue drops, PagerDuty fires, pressure builds

Full Debrief

Root cause, optimal path, what you missed

Start Playing Free

Play More. Unlock More.

Every scenario you complete earns XP. Hit milestones to unlock pro scenarios for free.

0XP

0 XP

300 XP

600 XP

1000 XP

Free

Start Free

The Mysterious Timeout

The Expired Certificate

The AI That Ate Production

300 XP

Unlock 1 Scenario

Choose an intermediate scenario

600 XP

Unlock 2 Scenarios

Choose another scenario

1000 XP

Unlock 3 Scenarios

Choose an advanced scenario

Or skip the grind - Pro unlocks all scenarios instantly.

What You Get - Free

✓The viral Terraform scenario everyone is talking about (685K+ views)
✓10+ incidents across databases, Kubernetes, cloud, and security (growing)
✓Leaderboard ranking against other engineers
✓Score breakdown and solution walkthroughs

Your first production incident shouldn't be your worst one.

Most engineers and technical founders get paged cold with zero prior experience handling a real incident. Reading runbooks doesn't build on-call instincts. YouBrokeProd drops you into realistic incident simulations so when the real page comes in at 3 AM, you've already been there.

On-Call Skills That Actually Stick

10+ scenarios across beginner, intermediate, and advanced. New ones every 2 weeks.

Triage Database Failures Fast

Read Postgres error states, diagnose connection pool saturation, and fix replication issues without guessing.

Debug Kubernetes Under Pressure

Diagnose CrashLoopBackOff, OOMKills, and missing secrets the way you would on a real EKS cluster - kubectl and nothing else.

Spot Security Issues Before They Escalate

Recognize credential exposure patterns, suspicious traffic, and misconfigurations that lead to real breaches.

How It Works

Each scenario is a real-time simulation running in your browser. No setup. Just you, a terminal, and a production incident to solve.

Get Paged

Pick a scenario and difficulty. You get a briefing with symptoms, a simulated terminal, and a ticking clock.

Investigate

Run real commands in the terminal - check logs, query metrics, inspect configs. Built-in hints if you get stuck.

Diagnose & Fix

Submit your root cause diagnosis, then apply the fix command. Scored on speed, accuracy, and efficiency.

Debrief

See what you got right, what you missed, and the optimal diagnostic path. Compare your score on the leaderboard.

On-call training for your whole team?

Run the same incident simulation across your SRE, platform, or founding engineering team. Compare scores, identify skill gaps, reduce MTTR, and build shared muscle memory for when the real pages come in. Manager reports and team leaderboards included.

See Team Plans

The Next Incident Won't Wait.
Will You Be Ready?

Start Your First Simulation

Production Incident TrainingThat Feels Real

When AI Breaks Production

30 Seconds, They Said

The Fix That Wasn't

Kubernetes SRE Professional Pack10 Kubernetes scenarios covering probes, OOM, DNS, rollouts, storage, RBAC, and multi-layer failures.

An AI agent ran terraform destroy on production.2.5 years of data - gone. Can you recover it?

It's Cyber Monday. Payments just stopped. You're on call.

War Room - train your whole team on the same incident

Not a Tutorial. Not a Quiz.A Real-Time Incident Simulation.

Agentic Incidents

30 Seconds, They Said

The Fix That Wasn't

The Mysterious Timeout

The Expired Certificate

No Space Left on Device

The Vanishing RAM

Situation Room: DNS Failure

The Mobile Apocalypse

Cold Start Hell

Cache Stampede Chaos

Kubernetes SRE Professional Pack

Cluster Ops: The Infinite Restart Loop

The Silent Killer Probe

The OOM That Lies

The Phantom Rollout

The DNS That Only Works Sometimes

The 3 AM Page

The Pods That Won't Land

The Volume That Won't Let Go

The Service Account That Lost Its Power

CoreDNS Negative Cache Poison

Play More. Unlock More.

What You Get - Free

Your first production incident shouldn't be your worst one.

On-Call Skills That Actually Stick

Triage Database Failures Fast

Debug Kubernetes Under Pressure

Spot Security Issues Before They Escalate

How It Works

Get Paged

Investigate

Diagnose & Fix

Debrief

On-call training for your whole team?

The Next Incident Won't Wait.Will You Be Ready?

Production Incident Training
That Feels Real

Kubernetes SRE Professional Pack
10 Kubernetes scenarios covering probes, OOM, DNS, rollouts, storage, RBAC, and multi-layer failures.

An AI agent ran `terraform destroy` on production.
2.5 years of data - gone. Can you recover it?

Not a Tutorial. Not a Quiz.
A Real-Time Incident Simulation.

The Next Incident Won't Wait.
Will You Be Ready?