Three nodes went NotReady at 2 AM on a Friday. Our on-call engineer spent two hours clicking through dashboards, running kubectl commands one at a time, and cross-referencing logs by hand. No runbooks. No automation. Just a very tired human and a very angry cluster.That incident is what pushed me to build a real Kubernetes management process. If you’re managing clusters reactively right now, this guide walks through the exact steps to change that.
The Manual Management Problem
Most teams get into trouble the same way. They spin up a cluster, get workloads running, and treat Kubernetes as a black box until something breaks.
Manual cluster management doesn’t scale. One node is manageable. Ten nodes across three namespaces with different teams deploying daily? You need automation, not heroics. The problem isn’t that things break – it’s that without a baseline process, every incident starts from zero.
Why Clusters Drift and Degrade
Kubernetes clusters don’t fail suddenly. They drift. Nodes accumulate resource pressure. etcd grows unchecked. Someone deploys a DaemonSet with no resource limits and it slowly eats CPU across every node in the cluster.
The three root causes I see most often:
- No namespace resource quotas – Workloads compete without limits until something gets starved
- No etcd backup process – Your cluster’s entire state lives in etcd, and most teams treat it like it’s indestructible
- No consistent health monitoring – Problems compound quietly until they become visible outages
Here’s my honest take: most Kubernetes outages are operational failures, not platform failures. Kubernetes is excellent at orchestrating containers. The gaps are almost always in how teams manage and observe it day-to-day.
Step 1 – Build a Cluster Health Baseline
Before you fix anything, know what normal looks like. Here’s the script progression I use for cluster health checks. (I run a version of this every morning. It takes 30 seconds and has caught problems before users noticed them.)
Version 1 – The Quick and Dirty Check
Start ugly. Get it working.
#!/bin/bash
# v1 - quick cluster check, no frills
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
kubectl top nodes
Ugly but works. You can see node state, anything not running, and resource consumption at a glance. The problem is there’s no logging, no thresholds, and no alerting. You’re just eyeballing output and hoping you notice something important.
Version 2 – Add Logging and Context
#!/bin/bash
# v2 - cluster health check with dated logs
LOGFILE="/var/log/k8s-health-$(date +%Y%m%d).log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
echo "=== Cluster Health Check: $TIMESTAMP ===" | tee -a "$LOGFILE"
echo "--- Node Status ---" | tee -a "$LOGFILE"
kubectl get nodes -o wide | tee -a "$LOGFILE"echo "--- Problem Pods ---" | tee -a "$LOGFILE"
kubectl get pods --all-namespaces \
--field-selector=status.phase!=Running,status.phase!=Succeeded \
| tee -a "$LOGFILE"
echo "--- Node Resource Usage ---" | tee -a "$LOGFILE"
kubectl top nodes | tee -a "$LOGFILE"
echo "--- Top 10 Pods by CPU ---" | tee -a "$LOGFILE"
kubectl top pods --all-namespaces --sort-by=cpu | head -10 | tee -a "$LOGFILE"
Much better. Dated log files mean you can compare today’s state against last Tuesday. But you’re still running this manually, which means it only runs when someone remembers to run it.
Version 3 – Production-Ready with Alerting
#!/bin/bash
# v3 - production cluster health check
# Schedule via cron: */15 * * * * /usr/local/bin/k8s-health.sh
LOGFILE="/var/log/k8s-health-$(date +%Y%m%d).log"
SLACK_WEBHOOK="${SLACK_WEBHOOK_URL}"
ALERT=false
ALERT_MSG=""
check_nodes() {
NOT_READY=$(kubectl get nodes --no-headers | grep -v " Ready" | wc -l)
if [ "$NOT_READY" -gt 0 ]; then
ALERT=true
ALERT_MSG+="CRITICAL: $NOT_READY node(s) not ready\n"
kubectl get nodes | grep -v " Ready" >> "$LOGFILE"
fi
}
check_pods() {
PROBLEM_PODS=$(kubectl get pods --all-namespaces --no-headers \
--field-selector=status.phase!=Running,status.phase!=Succeeded \
| grep -v 'Completed' | wc -l)
if [ "$PROBLEM_PODS" -gt 5 ]; then
ALERT=true
ALERT_MSG+="WARNING: $PROBLEM_PODS pod(s) in problem state\n"
fi
}
send_alert() {
if [ "$ALERT" = true ] && [ -n "$SLACK_WEBHOOK" ]; then
curl -s -X POST "$SLACK_WEBHOOK" \
-H 'Content-type: application/json' \
--data "{\"text\":\"K8s Cluster Alert:\\n$ALERT_MSG\"}"
fi
}
{
echo "=== $(date '+%Y-%m-%d %H:%M:%S') ==="
check_nodes
check_pods
kubectl top nodes
} | tee -a "$LOGFILE"
send_alert
Now this is useful. Schedule it every 15 minutes via cron, and your cluster health checks run whether anyone remembers to run them or not. Slack gets a message when thresholds are breached.
Step 2 – Automate etcd Backups
etcd holds every piece of cluster state – deployments, services, ConfigMaps, secrets, RBAC policies. All of it. If you lose etcd without a backup, the cluster is gone. And I mean everything is gone.
Some teams assume that running it on dedicated hardware with physical redundancy is enough protection. It isn’t. Hardware RAID keeps the disk alive. It doesn’t protect you from a corrupted etcd write or an accidental kubectl delete on the wrong namespace.
Here’s the backup script that should be running daily on every control plane node (tested against Kubernetes 1.29):
#!/bin/bash
# etcd snapshot backup - run daily via cron
BACKUP_DIR="/etc/etcd/backups"
SNAPSHOT_NAME="etcd-snapshot-$(date +%Y%m%d-%H%M%S).db"
mkdir -p "$BACKUP_DIR"
ETCDCTL_API=3 etcdctl snapshot save "$BACKUP_DIR/$SNAPSHOT_NAME" \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key# Always verify the snapshot after creation
ETCDCTL_API=3 etcdctl snapshot status "$BACKUP_DIR/$SNAPSHOT_NAME" \
--write-out=table
# Rotate: keep only the last 7 days
find "$BACKUP_DIR" -name '*.db' -mtime +7 -delete
echo "Snapshot complete: $SNAPSHOT_NAME"
The verification step matters. A backup you’ve never tested restoring is a hope, not a backup. Run etcdctl snapshot status every time and check the hash is valid.
Step 3 – Enforce Resource Quotas per Namespace
This single change has prevented more cluster-wide incidents than anything else I’ve implemented. Without quotas, one team’s runaway deployment can starve every other workload on the cluster.
Apply a ResourceQuota to every namespace that teams deploy into:
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-a
spec:
hard:
requests.cpu: "4"
requests.memory: 8Gi
limits.cpu: "8"
limits.memory: 16Gi
pods: "20"
persistentvolumeclaims: "10"
kubectl apply -f team-quota.yaml
kubectl describe resourcequota team-quota -n team-a
This works well for stable multi-tenant clusters but gets complicated when teams have wildly different workload profiles – a team running batch jobs needs very different limits than a team running real-time APIs. Start with generous limits and tighten based on actual usage data from kubectl top, not guesswork.
Verifying Cluster Health After Changes
Every change to the cluster – new deployments, quota updates, node additions, Helm releases – should be followed by a verification pass. Here’s the post-change checklist I run as a script:
#!/bin/bash
# Post-change cluster verification
echo "=== Cluster State Verification ==="
echo "Node Status:"
kubectl get nodes -o wide
echo ""
echo "System Pods (kube-system):"
kubectl get pods -n kube-system
echo ""
echo "Recent Events (last 20):"
kubectl get events --all-namespaces \
--sort-by='.lastTimestamp' \
| tail -20
echo ""
echo "etcd Endpoint Health:"
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
The kubectl get events output is the most underused diagnostic tool in the Kubernetes toolbox. Events show scheduling decisions, image pull failures, OOMKilled restarts, and quota rejections in real time. Look at events before you look at pod logs.
Prevention – Build Your Ongoing Management Loop
Monitor Node Conditions Proactively
Kubernetes tracks node conditions continuously: MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable. Poll these on a schedule rather than waiting for pods to start failing:
# Check for any non-Ready conditions across all nodes
kubectl get nodes -o json | \
jq -r '.items[] | .metadata.name as $n | .status.conditions[] |
select(.type != "Ready" and .status == "True") |
"Node: " + $n + " | Condition: " + .type'
A node showing DiskPressure: True is telling you it’s about to start evicting pods. Catching that before eviction happens is the difference between a maintenance window and a midnight incident.
Run Monthly Namespace Audits
Abandoned resources accumulate. Deployments with zero replicas. Services pointing to nothing. PVCs consuming storage for workloads that no longer exist. A monthly loop catches these before they become noise in your billing or your metrics:
#!/bin/bash
# Monthly namespace resource audit
for ns in $(kubectl get namespaces --no-headers -o custom-columns=':metadata.name'); do
echo "=== Namespace: $ns ==="
echo "Deployments with 0 replicas:"
kubectl get deployments -n "$ns" -o json | \
jq -r '.items[] | select(.spec.replicas == 0) | .metadata.name'
echo "Unbound PVCs:"
kubectl get pvc -n "$ns" --no-headers | grep -v Bound
done
Standardize Deployments with Helm 3
If teams are deploying raw YAML manifests without a package manager, you’re creating snowflake clusters. Helm 3.x gives you versioned, templated, rollback-capable deployments. Rollbacks should be boring. A boring rollback means your process worked.
Pair your etcd snapshots with a solid offsite backup strategy – snapshots sitting only on the control plane node don’t protect you from the scenario where that node is the thing that failed. Move backups off the cluster entirely, on a schedule.
Audit RBAC Quarterly
RBAC in Kubernetes is powerful and frequently misconfigured. Cluster-admin bindings handed out casually accumulate over time. Run quarterly audits of ClusterRoleBindings and RoleBindings across all namespaces and remove anything that isn’t documented and justified. The patterns in IT Auditing in the Age of AI, IoT, and Zero Trust apply directly to Kubernetes access control reviews – the principles are the same even if the tooling is different.
For teams placing Kubernetes within broader cloud strategy decisions, the XaaS Cloud Service Models: Security Guide for IT Teams covers how container orchestration fits into IaaS, PaaS, and SaaS responsibility boundaries – useful context when deciding what your team actually owns. And if your organization is still working through foundational cloud adoption challenges, Challenges in Cloud Computing: Security, Data Management & How to Overcome Them covers the data management and security decisions that sit underneath every cluster management choice.
Stop Reacting and Start Managing
The v1 health check script I showed above took me about 10 minutes to write. The v3 version came from six months of iteration after real incidents – each one teaching me what we weren’t watching. Don’t wait for an incident to build your process. Start with the quick and dirty version and improve it.
But here’s what matters: the script itself isn’t the point. The habit of running it is. Consistent, automated observation is what separates teams that manage clusters from teams that survive them.
If your team needs help building a Kubernetes management framework or wants an expert review of your current cluster setup, reach out to the SSE team. We work with IT operations teams to build the operational foundations that keep infrastructure stable and manageable as workloads grow.
Related reading: How to Manage DNS and DHCP: IT Admin Complete Guide | GitHub Discussions: Community Engagement for IT Teams | Citrix NetScaler: Complete Admin Guide for IT Teams
