Skip to main content
  1. Posts/

Case Study: Fleet-Scale Kernel Automation at Twitter

Nick Liu
Author
Nick Liu
Building infrastructure for Facebook Feed Ranking at Meta. Previously at Walmart, Twitter, AWS, and eBay. MS in Computer Science at Georgia Tech.
Table of Contents
At Twitter, I was responsible for kernel updates across 5,000+ production servers. Updating a kernel is risky on one machine. Doing it across a fleet, without downtime, without data loss, and without breaking the services that millions of people depend on, is a different problem entirely.

The Problem
#

Twitter’s production infrastructure ran on thousands of bare-metal servers across multiple data centers. Each server ran a Linux kernel that needed regular updates for security patches, performance improvements, and hardware compatibility.

The challenge wasn’t updating one kernel. It was updating thousands, safely:

  • Heterogeneous fleet: Different hardware generations, different workloads, different kernel configurations. A kernel that works perfectly on one host type might crash on another.
  • Zero tolerance for downtime: These servers ran core Twitter services. A bad kernel update could take down a shard of the user timeline, DM delivery, or ad serving.
  • Manual process: Before my work, kernel updates were largely manual. Engineers would update hosts in small batches, watch for issues, and roll back if something went wrong. At 5,000+ hosts, this didn’t scale.
  • Validation gap: There was no systematic way to validate that a new kernel version was safe for a given host type before rolling it out.

The Approach
#

I built three interlocking systems to solve this:

flowchart LR
    A["Canary\nValidation"] --> B["Wave 1\n1%"]
    B --> C["Wave 2\n5%"]
    C --> D["Wave 3\n25%"]
    D --> E["Full Fleet"]
    B -- anomaly --> F["Pause &\nAlert"]
    C -- anomaly --> F
    D -- anomaly --> F

1. Canary Kernel Validation Library
#

Before any kernel could be rolled out fleet-wide, it had to pass canary validation. I built a Python library that:

  • Provisioned canary hosts: Selected representative hosts from each hardware/workload combination in the fleet
  • Applied the kernel update: Installed the new kernel and rebooted the canary hosts
  • Ran validation suites: Checked system stability, performance benchmarks, and application-level health checks
  • Compared baselines: Measured the canary against production baselines — CPU utilization, memory pressure, I/O latency, network throughput

Only after a kernel passed canary validation on every host type would it be approved for fleet-wide rollout.

Canary validation isn’t just “does the kernel boot?” It’s “does the kernel behave identically to the current one under production-like load?” A kernel can boot fine and still introduce a 5% latency regression that cascades into user-visible impact.

2. Automated Rollout System
#

Once a kernel was validated, the rollout system handled deployment in progressive waves:

  • Wave 1: 1% of the fleet (a handful of hosts per data center)
  • Wave 2: 5% — expanding to more host types
  • Wave 3: 25% — majority coverage
  • Wave 4: Remaining hosts

Between each wave, the system monitored for anomalies: unexpected reboots, performance regression, application errors. If any signal crossed a threshold, the rollout paused automatically and alerted the on-call engineer with context about what went wrong and which hosts were affected.

3. Fleet Configuration Standardization
#

I discovered that a big chunk of fleet management pain came from configuration drift. Hosts had been manually tweaked over years and no longer matched their expected state.

I built tooling to:

  • Audit configurations: Scan every host and compare its actual state to the declared state
  • Detect drift: Identify hosts that had diverged from their intended configuration
  • Remediate automatically: For safe divergences, apply corrections. For risky ones, flag for human review.

This wasn’t strictly a kernel problem, but it was a prerequisite. You can’t safely automate kernel updates on hosts whose configuration you don’t fully understand.

4. Cache Service Custom Commands
#

I also designed a custom commands system for Twitter’s Redis -based cache services using Go . This let operators inspect and modify cache behavior at runtime without restarting services, which was critical for debugging production issues without customer impact.

The Results
#

Metric Before After
Kernel update method Manual, batch-by-batch Automated, progressive rollout
Time to update fleet Weeks Days
Hosts with validated kernels Partial 5,000+ (full fleet)
Configuration drift detection None Continuous
On-call ticket volume (peak) Unmanageable 140+ resolved in one week

The “140+ tickets in one on-call week” stat deserves context. This wasn’t a normal week; the fleet had accumulated significant technical debt. The tooling I’d built let me systematically triage and resolve issues that would have previously required investigating each host individually.

What I Learned
#

Automation Without Validation Is Dangerous
#

The temptation with fleet automation is to focus on speed — how fast can we push updates to every host? But speed without validation means speed at failing. The canary validation system was the most important piece, not the rollout automation.

Configuration Drift Is the Silent Killer
#

The hardest bugs to debug in fleet management aren’t kernel bugs. They’re “why does this host behave differently from every other host of the same type?” The answer is almost always configuration drift that accumulated over months or years. Investing in configuration auditing paid for itself many times over.

On-Call Is a Design Problem
#

Resolving 140+ tickets in a week wasn’t about working harder. It was about having the right tools. When your tooling gives you enough context to diagnose and resolve issues in minutes instead of hours, you can handle 10x the volume. The best on-call experience is one where the tools do the investigation and the human makes the decision.

Fleet management at scale comes down to trust: trusting that your hosts are in the state you think they are, trusting that an update won’t break things, and trusting that if something does go wrong, you’ll know immediately and can recover automatically. Every system I built was about establishing and maintaining that trust.

This experience shaped my approach to infrastructure: instrument everything, validate before acting, and design systems that explain themselves when they fail.

For more about my career journey and other projects, see my experience page or projects page.

Related

Case Study: Building AWS Billing's Unbilled Usage Auditor

I spent five years on the AWS Billing team. The hardest problem I tackled was detecting when customers used AWS services but weren’t charged correctly. This post walks through how I designed a system that reduced charge discrepancies by 300x and eliminated 230 million monthly false positives. The Problem # AWS billing is trickier than it looks. When a customer launches an EC2 instance, writes to S3, or queries DynamoDB, each action generates a usage record. These records flow through a pipeline that calculates charges based on the customer’s pricing plan, region, and service tier.

Managing Dotfiles Like a Pro with Yadm

Every developer eventually reaches the point where their configs become too valuable to lose. Here’s how I use yadm to manage my macOS dotfiles with automated testing, daily maintenance, and a pre-commit workflow that keeps everything in check. For me, the turning point was spending a weekend setting up a new MacBook and realizing I couldn’t reproduce my environment reliably. That’s when I started managing my dotfiles properly.