Designed and built a distributed system that detects unbilled usage across all AWS services — reducing charge discrepancies by 300x and eliminating 230 million monthly false positives.
Key Metrics #
300x Reduction in Discrepancies
$125,000 → $432
230M False Positives Eliminated
~95% Alert Actionability
Architecture #
flowchart LR
A["Usage Records\n(Billions/day)"] --> B["Smart Sampling\n& Aggregation"]
B --> C["Multi-Signal\nValidation"]
C --> D["Automated\nResolution"]
D --> E{Real issue?}
E -- Yes --> F["Alert with\nDiagnosis"]
E -- No --> G["Auto-resolve\n& Log"]
style B fill:#6366f1,color:#fff
style C fill:#6366f1,color:#fff
style D fill:#6366f1,color:#fff
Technical Deep Dive #
Aggregation over Brute-Force #
Instead of checking every individual usage record (which generated 230M false positives), the system aggregates at the service-account-period level.
- Built on DynamoDB for consistent low-latency reads at any scale
- Each record stores expected charge, actual charge, pricing plan, and discount metadata
- Reduced comparison space by orders of magnitude while preserving detection capability
Beyond Simple Mismatch Detection #
A single charge mismatch doesn’t indicate a problem. The validation pipeline (built on AWS Lambda) checks multiple signals:
- Temporal correlation — Is this a timing issue that self-corrects?
- Pricing context — Did a pricing change or discount explain the difference?
- Historical pattern — Has this account shown similar patterns before?
- Magnitude thresholds — Is the discrepancy large enough to investigate?
Only records failing all validation checks are escalated.
From Symptom to Diagnosis #
Common discrepancy patterns trigger automated remediation:
- Re-processing dropped usage records
- Applying missing discounts retroactively
- Flagging records for manual review with specific context about what went wrong
Engineers receive alerts with a diagnosis, not just a symptom.
Impact #
Tech Stack #
Java
DynamoDB
AWS Lambda
Distributed Systems
Billing Pipeline
Read the Full Story