The Problem #
Twitter’s production infrastructure ran on thousands of bare-metal servers across multiple data centers. Each server ran a Linux kernel that needed regular updates for security patches, performance improvements, and hardware compatibility.
The challenge wasn’t updating one kernel. It was updating thousands, safely:
- Heterogeneous fleet: Different hardware generations, different workloads, different kernel configurations. A kernel that works perfectly on one host type might crash on another.
- Zero tolerance for downtime: These servers ran core Twitter services. A bad kernel update could take down a shard of the user timeline, DM delivery, or ad serving.
- Manual process: Before my work, kernel updates were largely manual. Engineers would update hosts in small batches, watch for issues, and roll back if something went wrong. At 5,000+ hosts, this didn’t scale.
- Validation gap: There was no systematic way to validate that a new kernel version was safe for a given host type before rolling it out.
The Approach #
I built three interlocking systems to solve this:
flowchart LR
A["Canary\nValidation"] --> B["Wave 1\n1%"]
B --> C["Wave 2\n5%"]
C --> D["Wave 3\n25%"]
D --> E["Full Fleet"]
B -- anomaly --> F["Pause &\nAlert"]
C -- anomaly --> F
D -- anomaly --> F
1. Canary Kernel Validation Library #
Before any kernel could be rolled out fleet-wide, it had to pass canary validation. I built a Python library that:
- Provisioned canary hosts: Selected representative hosts from each hardware/workload combination in the fleet
- Applied the kernel update: Installed the new kernel and rebooted the canary hosts
- Ran validation suites: Checked system stability, performance benchmarks, and application-level health checks
- Compared baselines: Measured the canary against production baselines — CPU utilization, memory pressure, I/O latency, network throughput
Only after a kernel passed canary validation on every host type would it be approved for fleet-wide rollout.
2. Automated Rollout System #
Once a kernel was validated, the rollout system handled deployment in progressive waves:
- Wave 1: 1% of the fleet (a handful of hosts per data center)
- Wave 2: 5% — expanding to more host types
- Wave 3: 25% — majority coverage
- Wave 4: Remaining hosts
Between each wave, the system monitored for anomalies: unexpected reboots, performance regression, application errors. If any signal crossed a threshold, the rollout paused automatically and alerted the on-call engineer with context about what went wrong and which hosts were affected.
3. Fleet Configuration Standardization #
I discovered that a big chunk of fleet management pain came from configuration drift. Hosts had been manually tweaked over years and no longer matched their expected state.
I built tooling to:
- Audit configurations: Scan every host and compare its actual state to the declared state
- Detect drift: Identify hosts that had diverged from their intended configuration
- Remediate automatically: For safe divergences, apply corrections. For risky ones, flag for human review.
This wasn’t strictly a kernel problem, but it was a prerequisite. You can’t safely automate kernel updates on hosts whose configuration you don’t fully understand.
4. Cache Service Custom Commands #
I also designed a custom commands system for Twitter’s Redis -based cache services using Go . This let operators inspect and modify cache behavior at runtime without restarting services, which was critical for debugging production issues without customer impact.
The Results #
| Metric | Before | After |
|---|---|---|
| Kernel update method | Manual, batch-by-batch | Automated, progressive rollout |
| Time to update fleet | Weeks | Days |
| Hosts with validated kernels | Partial | 5,000+ (full fleet) |
| Configuration drift detection | None | Continuous |
| On-call ticket volume (peak) | Unmanageable | 140+ resolved in one week |
The “140+ tickets in one on-call week” stat deserves context. This wasn’t a normal week; the fleet had accumulated significant technical debt. The tooling I’d built let me systematically triage and resolve issues that would have previously required investigating each host individually.
What I Learned #
Automation Without Validation Is Dangerous #
The temptation with fleet automation is to focus on speed — how fast can we push updates to every host? But speed without validation means speed at failing. The canary validation system was the most important piece, not the rollout automation.
Configuration Drift Is the Silent Killer #
The hardest bugs to debug in fleet management aren’t kernel bugs. They’re “why does this host behave differently from every other host of the same type?” The answer is almost always configuration drift that accumulated over months or years. Investing in configuration auditing paid for itself many times over.
On-Call Is a Design Problem #
Resolving 140+ tickets in a week wasn’t about working harder. It was about having the right tools. When your tooling gives you enough context to diagnose and resolve issues in minutes instead of hours, you can handle 10x the volume. The best on-call experience is one where the tools do the investigation and the human makes the decision.
This experience shaped my approach to infrastructure: instrument everything, validate before acting, and design systems that explain themselves when they fail.
For more about my career journey and other projects, see my experience page or projects page.