We had a client come to us six months post-deployment with a vSphere cluster running at 78% average CPU utilization across all hosts. DRS was enabled. HA was enabled. Everything was green in the dashboards. And yet their VMs were crawling. Turns out nobody had touched the DRS aggressiveness slider since initial setup. It was sitting at the default – level 3 – which is about as useful as a speed limit sign with no enforcement.
Here is my position: most VMware vSphere environments are not poorly designed. They are poorly maintained. The platform can do a lot of heavy lifting for you, but only if you configure it past the defaults. And that is the problem – defaults feel safe, so they never get touched.
The Default Configuration Is Actively Costing You
vSphere ships with settings tuned for broad compatibility, not for performance or operational efficiency. Reasonable choice by VMware. Unreasonable choice to leave those settings untouched in production.
DRS aggressiveness defaults to level 3 out of 5. In practice, this means the scheduler makes conservative rebalancing decisions. If you are running a mixed-workload cluster – databases, web servers, dev environments all sharing the same hosts – level 3 will leave resource imbalances sitting there longer than necessary. I typically push clusters with stable, predictable workloads to level 4. For dynamic environments with bursty VMs, level 5. Migrations increase. That is fine. That is the whole point.
HA admission control is another default that quietly causes problems. The default policy reserves capacity for one host failure. Reasonable. But the default percentage reservations – 25% CPU and memory – can cut significantly into your effective capacity. If you are running 4 identical hosts, that is exactly right. If you have 8 hosts, you are over-reserving. Calculate this based on your actual host count and cluster size, not the out-of-the-box numbers.
DRS Anti-Affinity Rules Deserve More Respect
I see this missed constantly. Teams deploy multi-node application clusters – SQL Always On, RabbitMQ, whatever – and then wonder why two nodes end up on the same ESXi host after a rebalance. They never configured anti-affinity rules.
Anti-affinity rules force DRS to separate specified VMs across physical hosts. For any VM pair where co-location equals a single point of failure, this is non-negotiable. The blast radius of losing a single ESXi host should never include two nodes of the same cluster.
This is one of those “best practices” I will stand behind without qualification: anti-affinity rules for HA workloads, always. No exceptions.
Storage Policy Based Management Is Not Just Marketing
Why SPBM Changes How You Provision
Before Storage Policy Based Management (SPBM) was widely adopted, storage provisioning was a manual, tribal-knowledge exercise. “Put the database VMs on that datastore, dev VMs over there.” No documentation. No enforcement. New admin joins the team? Good luck figuring out which LUN is which tier.
SPBM lets you define storage policies – tags tied to capabilities like performance tier, redundancy level, or replication status – and assign those policies to VMs at provisioning time. vSphere ensures the VM lands on a datastore that meets the policy. vSAN takes this further by enforcing policies dynamically as the storage pool changes state.
Include backup tier requirements in your storage policies. VMs requiring daily off-site replication should not be landing on non-replicated datastores. If you do not have a solid offsite backup strategy tied to your storage policies, you are relying on institutional memory – which is not a backup strategy. For teams thinking through VM-level backup tooling, the breakdown of Veeam Cloud Connect is worth reading alongside your SPBM design.
The operational payoff is significant. Provisioning becomes self-documenting. Compliance checks are automated. You stop relying on the one person who remembers what is what.
Pairing SPBM with Storage DRS
Storage DRS handles I/O load balancing across datastore clusters. With SPBM defining the tiers and Storage DRS balancing within them, you get a setup that reacts to I/O pressure without manual intervention. I have seen environments go from weekly “why is the database slow” tickets to zero simply by enabling Storage I/O Control and setting appropriate thresholds. (35ms latency threshold is a reasonable starting point for most environments running mixed workloads.)
This works well for homogeneous datastore clusters. It breaks down fast if you mix different-speed datastores in the same cluster – keep your tiers separate or you will confuse the scheduler into making bad placement decisions.
Host Profiles: Stop Configuring ESXi Hosts by Hand
If you are configuring more than two ESXi hosts individually through the UI, you are generating toil. Pure, unnecessary toil.
Host profiles let you capture the configuration of a reference host – NTP settings, syslog configuration, networking layout, security settings – and apply that template across every other host in the cluster. More importantly, vCenter monitors for drift and flags any host that deviates from the attached profile. Configuration drift is one of the most common sources of “intermittent” issues that are actually deterministic inconsistencies between hosts.
Pair host profiles with Auto Deploy for new host provisioning and you have an almost entirely automated host lifecycle. A new blade slots into the rack, PXE boots, gets the correct ESXi image, gets the correct config. No manual touchpoints. If you are also working through a PowerShell automation readiness audit for your Windows management layer, combining that discipline with vSphere host profile automation creates a genuinely low-toil infrastructure stack.
For large environments, this is not optional. It is the difference between a repeatable infrastructure and a snowflake farm.
Update Manager: Patch on a Schedule or Regret It Later
I do not enjoy patching. Nobody enjoys patching. But I have seen enough unpatched ESXi hosts sitting at 6.7 Update 1 well into 2024 to know that “we will get to it” is not a patching strategy.
vSphere Update Manager – now Lifecycle Manager in vSphere 7.x and later – handles host patching with built-in orchestration. It respects DRS for workload evacuation, stages patches before maintenance windows, and handles cluster-level rolling updates. The blast radius of a failed patch is minimized because you are doing one host at a time with HA keeping workloads running elsewhere.
Set up a baseline. Attach it to your cluster. Schedule remediation. The complexity is not in the tooling – it is in never having set it up and then scrambling when a critical CVE drops on a Friday afternoon.
The Counterargument: “vSphere Manages Itself Just Fine”
I hear this from teams that have been running vSphere for years without major incidents. And honestly? They are not wrong that the platform is stable. But stable is not the same as efficient.
The argument for leaving defaults alone usually comes down to two things: “if it ain’t broke” or genuine resource constraints – time, headcount, competing priorities. Both are real. If you are running a 10-VM environment for a small business, the ROI on tuning DRS aggressiveness is probably negative.
But scale changes the math. At 200 VMs across 6 hosts, a misconfigured DRS policy means real performance degradation. Misconfigured HA means unexpected behavior during failure scenarios. Unmanaged host drift means hours of debugging when two hosts behave differently for no immediately obvious reason.
Why Operational Discipline Compounds Over Time
The SRE framing here is error budget. Every misconfigured component in your vSphere cluster is passively burning error budget. It is not catastrophic until it is. And when it is, you are in a situation where the root cause was a default that nobody questioned.
I have watched teams spend three days debugging intermittent VM performance issues that turned out to be storage I/O contention – a problem Storage DRS would have redistributed automatically. Three days. That is error budget and engineering time gone to a problem that should not have existed.
The investment in configuring vSphere correctly is front-loaded. The returns compound. That feedback loop is faster than most teams expect. The same pattern shows up at cloud scale too – if you are running hybrid workloads alongside on-premises vSphere, the lessons from real Azure infrastructure outages map directly to the same root causes: defaults left unchecked, capacity planning skipped, monitoring gaps that only surface during incidents.
What You Should Actually Do
Here is where I land after a decade of running vSphere environments:
- Audit your DRS settings today. Check aggressiveness level, verify anti-affinity rules exist for every HA workload, review VM-to-host rules for any licensing or performance requirements.
- Recalculate HA admission control. Match reserved capacity to your actual host count. Over-reserving wastes usable capacity. Under-reserving defeats the purpose of HA entirely.
- Implement SPBM for any environment with multiple storage tiers. Tag your datastores, define your policies, enforce them at provisioning time. Include backup requirements in the policy definition.
- Attach host profiles to your clusters. Capture a known-good reference host, apply the template, let vCenter monitor for drift. This eliminates a category of unexplained host behavior differences.
- Get Lifecycle Manager running patching on a schedule. Not eventually. A defined schedule tied to a maintenance window.
- Review your switch architecture. If you are still running vSphere Standard Switches on any production cluster, evaluate moving to Distributed Switches for centralized management, port mirroring, and NetFlow visibility.
If your vSphere environment has grown to the point where these changes feel overwhelming to tackle internally, that is a signal worth acting on. Bringing in external expertise for a structured audit – DRS tuning, HA validation, storage policy review, update baseline setup – often surfaces issues that internal teams have been too close to see. Not a knock on internal teams. Fresh eyes find different things.
The platform gives you the tools to run a well-managed, low-toil infrastructure. Most teams leave half of them unused. Ship the config changes. Measure the before and after. The feedback loop is faster than you think.
Ready to audit your vSphere environment or plan your next infrastructure upgrade? Contact the SSE team and we can work through your specific setup together.
Related reading: Kubernetes Cluster Management Essentials for IT Ops | How to Manage DNS and DHCP: IT Admin Complete Guide | GitHub Discussions: Community Engagement for IT Teams

