Skip to main content
  1. Categories/

AI & Productivity

Case Study: Fleet-Scale Kernel Automation at Twitter

At Twitter, I was responsible for kernel updates across 5,000+ production servers. Updating a kernel is risky on one machine. Doing it across a fleet, without downtime, without data loss, and without breaking the services that millions of people depend on, is a different problem entirely. The Problem # Twitter’s production infrastructure ran on thousands of bare-metal servers across multiple data centers. Each server ran a Linux kernel that needed regular updates for security patches, performance improvements, and hardware compatibility.

Case Study: Building AWS Billing's Unbilled Usage Auditor

I spent five years on the AWS Billing team. The hardest problem I tackled was detecting when customers used AWS services but weren’t charged correctly. This post walks through how I designed a system that reduced charge discrepancies by 300x and eliminated 230 million monthly false positives. The Problem # AWS billing is trickier than it looks. When a customer launches an EC2 instance, writes to S3, or queries DynamoDB, each action generates a usage record. These records flow through a pipeline that calculates charges based on the customer’s pricing plan, region, and service tier.