Erin Willingham
Erin Willingham
Krux
VP of Technical Operations
Erin Willingham is an Infrastructure Engineer for the Krux Platform, which serves billions of requests per day for many of the webs top properties like NY Times, HBO, Roku, Spotify and more. Erin has over 10 years of experience wearing many hats in internet operations, with a focus on DevOps and Security. His recent projects have included scaling Graphite beyond the limits known to man, automating security & vulnerability scanning and launching a Money Transfer platform into production.
Mo’ metrics, mo’ problems; when a million metrics per second isn’t enough anymore

Krux is an infrastructure provider for many of the websites and devices you use online today, like the NYTimes.com, WSJ.com, Roku and NBCU. For every request you make on those properties, Krux will get one or more as well. We grew from zero traffic to several billion requests per day from all over the world in the span of a few years, and we did so exclusively in AWS. Two years ago we shared our architecture for capturing a million unique metrics every second, economically, across our entire infrastructure in a presentation titled “How to measure everything”. Now, two years down the line, Krux has grown tremendously and a mere million metrics per second is no longer enough to capture all the behavioral and performance data from our infrastructure, and we had to find the next frontier of metrics collection. This is the story of all the challenges & pitfalls we encountered; I will share with you the details of how we designed and operate our global metrics infrastructure, to capture several millions of unique metrics every second from virtually every part of the system, including inside the web server, user apps, Amazon AWS and the OS itself, for around $10k/month, with minimal effort for the development team. This infrastructure is built entirely on top of Open Source software, including code that Krux has released, which is also currently in use by companies like the BBC, New York Times, GroupOn & Lyft to run their infrastructure. The content will be applicable for anyone collecting or desiring to collect vast amounts of metrics in a cloud or datacenter setting and making sense of them.