Abhishek Srivastava has been working as a Software Engineer at Expedia Group. He is also one of the core contributors for Expedia's open-sourced distributing tracing system - Haystack which ingests millions of trace logs per day. He is passionate about everything around the Web, be it designing or running systems at scale. He is always up for Coffee to talk about web, psychology or games.
We at Expedia work on a mission of connecting people to places through the power of technology. To accomplish this, we build and run hundreds of micro-services that provide different functionalities to serve every single customer request, which results in generating billions of events. Now, what happens when one or more services fail at the same time? Well, to improve the observability in our system, we see a need to connect these failure points across our distributed topology to reduce mean time to detect(MTTD) and know (MTTK)
In this talk, we will present the journey of distributed tracing in Expedia that started with Zipkin as a prototype and ended up building our own solution(in open source). We will do a deep dive into our architecture and demonstrate how we ingest terabytes of tracing data (around 8 TB / day) in production with a peak throughput of over 550,000 spans / second for hundreds of micro-services.
We use this data for trending service errors/latencies/rate, perform anomaly detection on the aggregated trends, build service-dependency and network-latency graphs, other than our primary use case of distributed tracing.
With this increasing number, there felt the need to have a real-time intelligent alerting and monitoring system to move towards 24/7 reliability. We will talk about how we use neural networks on trends and perform anomaly detection, including a deep dive into the architecture for the automated training pipeline and online, compute using streams in a cost-effective manner