The 3 AM Page That Started This Checklist
A logistics client paged us at 3 AM because their order API was returning 502s. The fix took ninety seconds once we logged in. Finding the runaway Python worker with ps aux | grep, killing it, and watching systemctl restart the unit cleanly. The post-mortem took longer than the fix.
That incident is why I wrote this Linux process management checklist. Every SRE I have hired in the last decade thinks they know ps, top, kill, and systemctl. Most of them know about sixty percent of what they need. The other forty percent is what bites you at 3 AM.
This is the audit I run on every new managed environment we onboard. Eight checkpoints, pass/fail criteria, and the commands that prove each one. Run it on a server you think you know. You will find at least one thing wrong.
Checkpoint 1: Can You Enumerate Every Running Process?
The baseline. If you cannot list every process with its PID, user, CPU, memory, and full command line, you are flying blind.
Command: ps aux
That gives you the full snapshot. PID, %CPU, %MEM, start time, and the command invocation. For anything more targeted, pipe it. To hunt for SSH sessions: ps aux | grep ssh. To filter by user: ps -u deploy. For a parent-child tree view: ps -ejH or ps auxf.
Pass criteria: You can identify every process owner, the parent, and the full command line including arguments. If you see [kworker]-style names you do not recognize, that is fine, those are kernel threads in brackets.
Fail criteria: Processes running as root with no clear owner or parent. Long-running processes with random base64 in the command line. That is your tell for crypto miners or shells dropped through an exploit. We spotted exactly that during a routine check of a client’s event logs last year. A nginx-named binary running out of /tmp. Never a good sign.
Checkpoint 2: Do You Have Real-Time Visibility?
Static snapshots from ps tell you what is running. Real-time tools tell you what is hurting.
Command: top
Inside top, the keys that matter are P to sort by CPU, M to sort by memory, and k to kill a process by PID without leaving the interface. That last one saves you forty seconds during an outage.
I install htop on every server we manage. Run sudo apt install htop on Debian-family or sudo dnf install htop on RHEL-family. Colored output, mouse selection, per-core CPU bars, and a tree view toggle. The mental load is lower at 3 AM, and lower mental load means fewer mistakes.
Pass criteria: A real-time monitor is installed and your runbook references it by name. The on-call engineer does not have to think.
Fail criteria: Your only option is top with no documented workflow. Worse: top is aliased to something weird in someone’s dotfiles that nobody else can read.
Checkpoint 3: Can You Terminate Processes Without Collateral Damage?
This is where engineers panic and reach for kill -9 immediately. Do not.
The signal hierarchy you should actually use:
kill 1234sends SIGTERM (signal 15). Politely asks the process to clean up and exit. Most well-behaved daemons flush buffers and close connections.kill -9 1234sends SIGKILL. The process gets no chance to clean up. Use this only when SIGTERM has been ignored for at least ten seconds.pkill firefoxkills by process name.pkill -f python_script.pymatches against the full command line, which is what you want when you have ten Python workers and only need to kill one.killall chrometerminates every process matching the name.
The blast radius of kill -9 on a database process is corrupted data files. We had a client lose three hours of writes on a Postgres instance because their previous contractor reflexively SIGKILLed the postmaster during a slow query. Do not be that contractor. For environments where this kind of operational discipline matters, our it outsourcing team writes the runbooks so the on-call engineer is not improvising.
Pass criteria: Your runbooks specify SIGTERM first, wait, then SIGKILL. Database processes are explicitly excluded from any automated kill scripts.
Fail criteria: Anyone on the team uses kill -9 as their default. Or worse, you have a cron job that pkill -9s anything matching a pattern. That is a foot-gun on a timer.
Checkpoint 4: Is Every Long-Running Service Managed by systemd?
If a service matters, it lives under systemd. No exceptions, no nohup in screen sessions, no PM2 wrappers babysat by a cron job.
When a Linux distribution boots, systemd is PID 1. It starts everything else. That makes systemctl your single source of truth for service state.
The commands that actually matter:
systemctl status nginxshows current state, recent log lines, and the PID.sudo systemctl restart nginxstops and starts cleanly.sudo systemctl enable nginxensures the unit starts at boot. The number of “production” services I have found that were not enabled would embarrass you.sudo systemctl disable nginxremoves the boot link.systemctl list-units --type=service --state=runninggives you the full inventory.
Pass criteria: Every business-critical service has a unit file, is enabled, and survives a reboot. You have tested this with an actual reboot, not just by reading the output.
Fail criteria: A service is running because someone nohuped it six months ago and nobody has rebooted since. kill stops the current process, but without a unit file or init script, it will not come back after reboot. That is the trap.
Checkpoint 5: Are You Reading the Right Logs?
systemd ships with journalctl, which is more powerful than people give it credit for.
Commands worth memorizing:
journalctl -u nginx -ffollows logs for one unit in real time.journalctl --since "1 hour ago" --priority=warningfilters by time window and severity.journalctl -u postgresql --since today --until "2 hours ago"bounds your search instead of paging through gigabytes.journalctl -p err -bshows error-level entries from the current boot.
Pass criteria: Your incident response runbook references journalctl with specific filters. Nobody is grepping /var/log/syslog by hand at 2 AM.
Fail criteria: You forwarded systemd journal to syslog years ago, never tested it, and the forwarder silently died last Tuesday. Test your logging path quarterly. Treat it like a ransomware protection backup, untested means broken.
Checkpoint 6: Do You Know What Owns Each Network Socket?
Process management is incomplete without network context.
Command: lsof -i :80
That tells you exactly which process is listening on port 80. To list every file a process has open: lsof -p 1234. To get the full network picture: ss -tulnp shows TCP and UDP listeners with the owning process.
We discovered this during a routine review of a healthcare client environment, an unfamiliar process was listening on port 31337. lsof -i :31337 traced it to a binary in /var/tmp with the deletion bit set, classic indicator of a process running from a file already unlinked from the filesystem. The kind of finding you catch in five minutes if you check, and miss for six months if you do not. The MITRE ATT&CK framework calls this T1070.004, indicator removal.
Pass criteria: You can produce a process-to-port mapping in under thirty seconds.
Fail criteria: Unidentified listeners on non-standard ports. Processes bound to 0.0.0.0 that should only listen on 127.0.0.1.
Checkpoint 7: Is Process Auto-Recovery Wired Up?
Toil hates you. Eliminate it.
The pattern from the SRE book: never do something twice that a script could do once. The simplest version:
# Check if myservice is running, restart if not
pgrep myservice || systemctl restart myservice
Drop that in a cron job that runs every minute, and you have a poor-man’s watchdog. Better: configure Restart=on-failure directly in the systemd unit file, with RestartSec=5 and StartLimitBurst=3. systemd handles the supervision, you do not need cron at all.
For larger environments, we use Ansible to push consistent systemd unit files across the fleet. One source of truth, one playbook to enforce it. If you want to see how this fits with other automation patterns, the PowerShell REST API guide covers the same philosophy from the Windows side.
Pass criteria: Critical services have Restart=on-failure and tested recovery behavior. You have actually killed the process and watched it come back.
Fail criteria: Recovery exists only in someone’s head. The runbook says “restart it” without specifying how.
Checkpoint 8: Do You Have an Inventory and Baseline?
The honest take: you cannot detect anomalies without a baseline.
For every server we manage, we capture ps auxf, systemctl list-units --type=service, and ss -tulnp output weekly. Store it next to the configuration backups, ideally on dedicated nas backup storage so the audit trail outlives the host. Diff this week against last week. New processes are either expected (a deploy) or suspicious. Either way, you want to know within seven days.
This is also how we catch software drift. The client running an undocumented Java process for two years before we got involved? That was caught by the first baseline diff after onboarding.
Pass criteria: Weekly baseline captured, diffed, and reviewed. Findings go into a ticket.
Fail criteria: No baseline exists. You only look at process state when something is already broken.
The Caveat Nobody Wants to Hear
None of this replaces a real monitoring stack. Tools like Netdata, Prometheus with node_exporter, or Datadog give you historical metrics and alerting that a shell session cannot. This checklist is what you run on day one of an engagement and what you fall back to when the monitoring stack is the thing that broke.
It is also Linux-specific. If you have AIX or Solaris in the environment, half the flags are different and systemctl does not exist. Read the man page. man ps is not optional reading.
Run the Audit, Fix What Fails
Print the eight checkpoints. Run them against your most critical server today. If you find three failures, that is normal. If you find zero, you missed something, run them again with fresh eyes.
The whole point of this checklist is to convert tribal knowledge into a repeatable five-minute review. Ship it, run it weekly, and stop relearning ps aux at 3 AM. If you want us to run it across your fleet and write the runbooks, get in touch.


