The change request was direct: patch all web servers before Friday’s maintenance window. Forty-three RHEL 8 hosts, one engineer, no automation. I tried it manually the first time – SSH loop, yum update, reboot, repeat. Three hours in, I miscounted reboots, patched a database server in the wrong host group, and triggered a page at 2 AM. I rebuilt the entire process with Ansible 2.12 that weekend. The Friday maintenance windows are boring now. Boring is exactly what you want.
What Shell Scripts Miss
The reflex when automating server tasks is to write a Bash script. Loop over a host list, SSH in, run commands, check exit codes. And it works for simple one-off operations. But shell scripts have no built-in idempotency – run the same script twice and you may get different results depending on current state. Error handling is manual. Parallelizing across large host counts requires custom logic that itself needs testing and maintenance.
Ansible solves these problems at the architecture level. It is agentless – no daemon runs on managed nodes, no client software to install or maintain. The control node pushes Python modules over SSH (port 22 for Linux, WinRM 5985/5986 for Windows), executes them, then removes them. Nothing persists on the target host after the run. No agent version drift, no additional listening ports on every managed server (which your security team will appreciate when they run their next port audit).
The control node requires Python 3.9+ and Ansible 2.14+ on any Linux host. That single node can manage thousands of hosts using configurable fork-based parallelism.
Inventory as Your Foundation
Every Ansible run starts with an inventory – the defined list of hosts and groups you are targeting. A static inventory in INI format is the simplest starting point:
[webservers]
web01.prod.example.com
web02.prod.example.com
[dbservers]
db01.prod.example.com ansible_user=dbadmin
[prod:children]
webservers
dbservers
[all:vars]
ansible_python_interpreter=/usr/bin/python3
Groups let you target infrastructure subsets cleanly. Variables at group or host level override global defaults. The ansible_python_interpreter line matters more than it looks – RHEL 8 and Ubuntu 22.04 both changed default Python paths, and Ansible auto-discovery can select the wrong interpreter on mixed-OS environments. Set it explicitly and avoid the confusion during production runs.
Dynamic inventory goes further. Collection-based plugins for AWS, Azure, and VMware query your infrastructure directly at run time. Your host list stays current automatically when instances scale up or down – no manual inventory file edits when the fleet changes.
Ad Hoc Commands Versus Playbooks
Ad hoc commands handle immediate, one-off operations during live troubleshooting:
ansible webservers -m ansible.builtin.ping
ansible all -m ansible.builtin.command -a 'uptime' --become
ansible dbservers -m ansible.builtin.service -a 'name=postgresql state=restarted' --become
Fast and useful when you need an answer immediately. But ad hoc commands produce no structured record, no idempotency guarantees, and no error-handling flow. The moment a task becomes part of a recurring operational workflow, it belongs in a playbook.
Playbooks are YAML-defined task sequences with handlers, variables, conditionals, and loops. The same playbook run on Tuesday and again on Thursday produces identical end state if nothing external changed. That is idempotency – and it is what separates trustworthy automation from fast scripting.
Building a Patching Playbook That Actually Works
Here is the production patching playbook structure I use for RHEL and Rocky Linux 9 web tier hosts. The serial directive controls how many hosts run concurrently:
---
- name: Web server patching
hosts: webservers
become: true
serial: 2
vars:
reboot_timeout: 600
pre_tasks:
- name: Check root partition disk usage
ansible.builtin.shell: df --output=pcent / | tail -1 | tr -d ' %'
register: disk_usage
changed_when: false - name: Abort if disk above 85 percent used
ansible.builtin.fail:
msg: "Disk at {{ disk_usage.stdout }}% - aborting before patching"
when: disk_usage.stdout | int > 85
tasks:
- name: Apply all available package updates
ansible.builtin.dnf:
name: '*'
state: latest
update_cache: true - name: Check whether reboot is required
ansible.builtin.stat:
path: /run/reboot-required
register: reboot_flag
- name: Reboot if kernel or critical library updated
ansible.builtin.reboot:
reboot_timeout: "{{ reboot_timeout }}"
when: reboot_flag.stat.exists post_tasks:
- name: Verify critical services are running
ansible.builtin.service:
name: "{{ item }}"
state: started
loop:
- httpd
- firewalld
serial: 2 tells Ansible to process only two hosts at a time. For a web tier behind a load balancer, this keeps capacity available throughout the patching window. Remove it and Ansible targets all hosts simultaneously – acceptable in dev, dangerous in production.
The pre_tasks block runs before any package operations. A disk check failure aborts the play for that host before touching the package manager. Running a full dnf update on a host at 97% disk utilization can corrupt the RPM database mid-update. The pre-check costs three seconds and prevents a two-hour recovery.
If you need a dedicated system for your Ansible control node, running it on a dedicated compute instance keeps it isolated from developer workstations and provides consistent SSH connectivity to your managed fleet around the clock.
Roles for When Playbooks Get Complicated
A flat playbook file is manageable up to around 150 lines. Beyond that, Ansible roles provide a directory structure that separates tasks, handlers, templates, variables, and defaults:
roles/
webserver/
tasks/main.yml
handlers/main.yml
templates/nginx.conf.j2
defaults/main.yml
vars/main.yml
files/
Roles are self-contained and reusable. You can test a role independently, version it separately from the calling playbook, and reference it across multiple playbooks. Ansible Galaxy hosts thousands of community roles – though I audit every external role before running it in production. A role you have not read is a playbook you do not understand.
Jinja2 templates (the .j2 files) are where configuration management earns its value. One nginx.conf template renders differently per host based on inventory variables – worker count, upstream pool members, SSL certificate paths. No more maintaining separate per-host config files or manually diffing between them when something diverges.
Protecting Secrets With Ansible Vault
Hardcoded passwords in playbooks end up in git history. Ansible Vault encrypts variable values or entire files using AES-256. Encrypting an inline string:
ansible-vault encrypt_string 'db_pass_here' --name 'db_password'
The output stores directly in your vars file:
db_password: !vault |
$ANSIBLE_VAULT;1.1;AES256
66386134653765386265663232303932...
The vault password file lives outside the project directory and is never committed to version control. Add no_log: true to any task that references credentials – Ansible verbose output and log files will otherwise print variable values in plaintext.
For environments with frequent secret rotation, the community.hashi_vault collection integrates directly with HashiCorp Vault. This works well for organizations with centralized secrets infrastructure but adds operational complexity that smaller environments do not need.
Validate Your Backups Before Any Patching Run
This sounds obvious. In practice, teams skip it under deadline pressure and regret it afterward. Before running any playbook that modifies system state at scale, verify that your most recent snapshot or backup job completed successfully. Keeping backups in a separate location – isolated from the hosts being patched and from your primary production infrastructure – is the only way to recover cleanly when a patching run goes wrong.
Write an actual pre-playbook check that queries your backup system and fails hard if no successful backup exists within the past 24 hours. That check belongs in pre_tasks, not in a README that nobody reads under pressure.
Where Ansible Has Real Limitations
Speed is the honest trade-off. Ansible executes tasks over SSH, processing hosts in parallel only up to the configured fork count. On a fleet of 500 servers with forks: 50, a full patching run takes longer than agent-based tools like Puppet or SaltStack. For pure throughput on massive node counts, Ansible is not the fastest available option.
YAML indentation errors are operationally fragile in ways that do not always fail loudly. Two extra spaces can silently change task scope. Run ansible-lint on every playbook before committing – it catches indentation issues, deprecated module usage, and risky patterns like bare shell commands where proper modules exist. This works well with a CI pipeline in place but adds friction for teams writing and running playbooks without a review step in between.
Windows support exists via WinRM but requires substantially more setup than SSH, and the debugging experience for WinRM connectivity problems is considerably worse than SSH troubleshooting. If more than 40% of your managed nodes run Windows, evaluate whether Ansible is the right primary tool before committing to it for fleet-wide automation.
For teams running Ansible alongside Kubernetes workloads, the kubernetes.core collection extends automation into cluster management – Kubernetes Cluster Management Essentials for IT Ops covers when cluster automation makes sense and when it introduces more complexity than it solves. If you are comparing Ansible to PowerShell-based automation for Windows-heavy environments, the readiness framework in PowerShell Automation Readiness Audit for IT Teams is worth reading before committing to either tool.
The Honest Assessment
My position: Ansible is the right first automation tool for any team managing Linux infrastructure at scale. The YAML syntax is readable without specialized training. The module library covers package management, service control, file operations, user accounts, network device configuration, cloud provisioning, and container management. The community is large enough that most problems you encounter already have documented solutions and tested examples available.
But badly written playbooks are just slow, fragile YAML that still requires manual intervention to complete. Idempotency is not automatic – it requires using the correct modules instead of raw shell commands, registering task output and acting on it conditionally, and testing against real infrastructure in a staging environment before any production run.
The architectural lessons in Azure Cloud Infrastructure: Hard Lessons from a Real Outage apply directly to automation design decisions around state management, rollback procedures, and what happens when automation runs against infrastructure in an unexpected condition.
Start with a single playbook that does one thing correctly. Use --check mode to preview what would change. Use --diff to see exact content differences before committing to a run. Move to roles only when a playbook exceeds 150 lines or needs reuse across projects. Add complexity only when the simpler version is proven operationally solid.
The patch windows are boring now. That is exactly what you want.
If you want a structured assessment of where automation fits in your current environment – what to automate first, which tooling matches your infrastructure profile, and where the risk points are – contact the SSE team at https://clients.sse.to/contact.php.
Related reading: VMware vSphere Administration Tips That Save Time | Kubernetes Cluster Management Essentials for IT Ops | How to Manage DNS and DHCP: IT Admin Complete Guide
