Beyond Sequential: A Recipe for Async Pipeline Observability and Alerting
How many of us heard microservices this microservices that? No matter how cool and useful the new tech stack is, at the end of the day we are responsible for its reliability and it’s up to us to ensure proper observability. Our company is composed of thousands of microservices and while implementing observability for synchronous HTTP requests was relatively straightforward: We leveraged Prometheus metrics and pretty much followed Google's SRE handbook for implementing SLOs (Service Level Objectives) — we still had about 10% of applications not covered since they are asynchronous in nature. Even though interactions within these pipelines don’t always happen in real time, they play an equally important role in customer satisfaction.
In this talk, we would like to present how our company implemented SLOs for asynchronous pipelines from start to finish - how to identify relevant key metrics that represent customer experience, how to add them using Prometheus, how to define good and valid events, the formulas we used for burn rate alert thresholds, and cover some successful case studies at the end.
We are planning to briefly go over types of async pipelines we have at our company and challenges that come with us. Then, we will dive into what components we need to implement a good observability solution and from there go over definitions of SLI (Service Level Indicator), SLO (Service Level Objective), SLA (Service Level Agreement) and how they correlate to customer experience. Further, we will cover Prometheus metric types and how a counter could be used to count errors, while histograms to keep track of latency distributions and how to define Good/Valid events that power SLOs using these custom metrics. Since SLOs need some alerting system, we will expand on burn rate alert thresholds and the formulas we defined for our async use case in contrast to sync use cases covered in Google’s SRE handbook. If we have enough time (based on further dry-runs and presentation prep) we could show some examples from the internal platform we developed for product teams to add SLOs for their applications and how the Good/Valid event queries, recording rules, and alerting rules are created for them automatically. In the end, we’ll wrap up with a (one or more) successful case study, where latency from event producer to event consumer went up from 58.2 sec to 9.8 min and led to an increase to 21K events waiting in the backlog to be processed), best practices and lessons learned.