The 4 AM Failover That Booted Everything in the Wrong Order
A manufacturing client called us at 4 AM because their primary site had lost power and the on-call tech had panicked into a full Veeam failover. Twelve VMs came up, but in alphabetical order. The application servers started before the domain controllers, the SQL cluster came up before DNS resolved, and nothing authenticated. The replicas were fine. The orchestration was not. This is exactly why Veeam failover plans exist, and exactly why you should never trust one you have not rehearsed end to end.
We spent the next ninety minutes manually rebooting services while the plant floor sat idle. The post-mortem is below, along with the runbook we now use across every managed environment that runs Veeam Backup & Replication 12.
The Incident Timeline
At 03:52 the UPS battery gave up. At 04:11 the on-call engineer right-clicked each replica individually in the console and selected Failover now. By 04:25 all twelve replicas were running on the DR host. By 04:26 none of them were useful. The domain controller replica had come up last. Everything that tried to authenticate during the preceding minute cached a failure and refused to retry without a service restart.
RPO was fine — replication had run fourteen minutes earlier. RTO was the problem. We had promised the business a thirty-minute RTO. We delivered something closer to two hours because the boot order was wrong and nobody had scripted the dependencies.
Root Cause: Individual Failover Versus Orchestrated Failover
The technician used the per-VM failover workflow documented in the Veeam operational guide: expand Replicas, select Ready, right-click the VM, choose Failover now, pick a restore point, confirm. That workflow is fine for a single VM. It is catastrophic for a dependent stack. It gives you no ordering, no delays between groups, no pre-failover scripts, and no way to test the whole thing without triggering real downtime.
A failover plan is the object that fixes this. It groups replicas, assigns them to boot tiers, inserts delays between tiers, and runs the whole sequence as one orchestrated action. You build it once, you test it quarterly, you trust it at 4 AM. Here is the opinionated take — if you run Veeam and you have not built a failover plan, you do not have disaster recovery. You have replicas. Those are not the same thing.
The Fix: A Three-Tier Failover Plan
We rebuilt the client’s DR posture around a tiered boot order. The plan lives in Home > Replicas > Failover Plans and orchestrates twelve VMs across three groups with delays tuned to the slowest service in each tier.
Tier 1 — Infrastructure (0 second delay)
Domain controllers, DNS, and the jump host boot first. We wait 120 seconds after the DC reports a healthy netlogon before releasing tier two. Microsoft’s AD DS documentation is explicit that a domain controller replica must reach a consistent state before dependent services trust it, and USN rollback is a real risk if you rush this.
Tier 2 — Data (120 second delay)
SQL Server, the file server, and the Veeam repository come up next. We wait another 180 seconds. SQL recovery on a large database can take longer than people expect, and application tiers that connect before recovery finishes will cache failures.
Tier 3 — Applications (300 second delay from start)
The ERP front end, the web tier, and two internal line-of-business apps. By the time they try to resolve DNS and connect to SQL, both are ready.
For teams managing this at scale across clients, we push the plan definitions through Ansible and Veeam PowerShell so the DR configuration is versioned alongside the rest of the infrastructure. That also means every managed environment under our backup solution gets an identical, auditable DR spec.
Planned Failover Versus Failover Now
The console exposes two verbs that sound similar and behave very differently. Failover now is for real disasters — the primary is gone and you are activating replicas from the last good restore point. Planned failover is for maintenance windows — Veeam replicates once more, powers off the primary, replicates the final delta, then brings up the replica with zero data loss. Use planned failover for data center migrations, host maintenance, or when you see an impending disaster you still have time to react to. Use failover now only when the primary is unreachable.
For Hyper-V environments, the PowerShell equivalent is Start-VMFailover. A test failover uses the -AsTest switch:
# Test failover against a specific snapshot without affecting production
# Verify the replica boots, services start, and applications respond
Get-VMSnapshot VM01 -Name Snapshot01 | Start-VMFailover -AsTest
# Verify the test VM is running in an isolated network before declaring success
Get-VM VM01 | Select-Object Name, State, Status
Test Your Failover Plan or Regret It
The 3-2-1-1-0 rule ends with a zero — zero errors after recovery verification. A failover plan you have never executed is an untested backup with extra steps. Veeam supports test failover against an isolated virtual lab, which spins the replicas up on a fenced network so they cannot talk to production. Run it quarterly. Run it after any topology change. Run it before you renew the vSphere license, because that is a calendar event you will not forget.
A Forrester analysis we referenced during the client’s DR review made the same point in blunter language — organizations that test recovery quarterly experience roughly half the unplanned downtime of those that test annually. Our own incident data from managed environments matches that number closely.
Caveats and Limitations
Failover plans are not magic. A few things they will not do for you:
- They will not fix application-layer dependencies that live outside the VM — external APIs, SaaS integrations, or third-party auth providers.
- They will not detect data corruption in the replica. If the primary was corrupt at the last replication cycle, the replica is corrupt too. You still need backups separate from replication.
- They will not handle IP re-addressing cleanly unless you configure network mappings and re-IP rules ahead of time.
- They cannot test what your users actually do. A passing test failover and a working business are not the same thing.
If your DR strategy sits inside a broader technology roadmap, the failover plan is one layer. You also need documented runbooks, tested restores, and people who have actually rehearsed the decision tree. Related reading from our team — Tracing Windows Boot and Service Init with Sysinternals is useful when you are diagnosing why a tier-three VM refuses to come up cleanly after a failover.
Lessons Learned From the 4 AM Incident
- Build the failover plan before you need it. Building one under pressure guarantees mistakes.
- Tier your VMs by dependency, not by alphabet. Domain controllers first, data next, apps last.
- Insert delays long enough for the slowest service in each tier to stabilize.
- Test quarterly in an isolated virtual lab. An untested plan is a theory.
- Document the failover now versus planned failover decision so the on-call engineer does not guess at 4 AM.
- Keep immutable backups separate from replication. Replication carries corruption forward.
If you want a second set of eyes on your Veeam failover plans, or you are not sure your last test actually proved anything, reach out to our team. We will run the plan in a lab, measure the real RTO, and tell you honestly whether the recovery works. Because the question is never whether the backup completed. The question is always — but have you tested the restore?


