Breandan Dezendorf
Breandan Dezendorf
Senior Operations Engineer
Breandan Dezendorf has been working in UNIX and Linux operations for over 15 years. His specialties include monitoring, alerting, trending, and log aggregation at scale. He is comfortable with many flavors of UNIX and Linux, but has spent his entire professional life using a Macintosh. He has spent considerable time working for academic institutions and traveling internationally, and enjoys helping others solve intricate technical problems.
Herding ELK: Scaling ELK Towards A Trillion Documents On Disk

A practical walkthrough of scaling ELK (Elasticsearch/Logstash/Kibana) at Fitbit Inc, from 7 days of retention at 45,000 documents a second to 30 days of retention at 335,000 documents a second and growing. When this project started, the production ELK cluster at Fitbit was handling a peak daily load of about 45,000 logs per second, and was storing them for 7 days. There was a desire to retain logs for 4 weeks to give developers and operations more time to evaluate the impact of changes to code and underlying systems, as well as for security audits of system logging information. At the same time, the logging volume has dramatically increased due to changes in logging, architecture changes to the internal services, and changes in user behavior. The current cluster peaks at 335,000 logs per second, which at a 30 day retention is over 800 billion documents on disk. During the scaling process, various bottlenecks were encountered and had to be overcome. This talk will walk through many of them and how we overcame them, as well as insight into the metrics and system statistics we keep an eye on. The transition involved upgrading or replacing almost every facet of the logging pipeline. Redis was dropped in favor of Apache Kafka, the data nodes were moved into a hot/warm architecture to help manage scaling the cost side of the cluster, and an archiving tier was established to preserve logs in a heavily compressed and encrypted form for long term security and failure analysis. Scripts were also developed to make index management, metrics gathering, and operational emergency work easier and more reliable.

* Changes in elasticsearch 2.x, notably the requirement that no fieldname contains a "." character and the strict field-type mappings have made the transition even more complicated, requiring computationally expensive plugins to filter and transform data as it arrives.

* Kafka replication over WAN links can be extraordinarily painful if you aren't watching TCP window sizes

* Balancing Kafka consumers versus Kafka topic partitions is important for throughput but has challenges to balance topic message distribution

* Kibana4 uses doc_values. Kibana3 (only usable with elasticsearch 1.x) can't. doc_values trade on-disk storage for active heap use during queries.

* Unrestricted access to Kibana is a difficult thing to manage. Ingesting the elasticsearch slow query log and correlating with the webserver's authentication logs can lead you to users who need help formatting queries correctly.