Exemplars - A Tale of Pillars Supporting The Observability Structure
Exemplars are point in time observations made at the time of metric generation. These observations usually have things like trace ID and span ID which can be used to navigate from a time series graph to a trace water fall. SDKs of Prometheus and OpenTelemetry natively support the generation of exemplars given the application is pre instrumented with distributed tracing. At eBay, we have done the needful to ensure that all applications emit HTTP API related golden signals as metrics and traces for each API. Given this, we are able to emit Exemplars for every host's APIs based on status code, path and verb. These exemplars get stored in Prometheus' circular ring buffer and is only available for a brief period of time depending on the ring buffer's memory.
For more than a year, we used Open Telemetry Collector's tail sampling processor. Over time, we realized that our traffic patterns are not condusive to do true tail sampling and doing it on top of our trace store (ClickHouse) is the only way to do it. An ah ha! moment allowed us to come up with the idea of using Exemplars as the source of truth for what should be sampled and what should not given that Exemplars provide unique traces for every kind of status code, API path and host. Retaining those traces could give us a very good representative sample set. GIven this assumption, we built a tail sampler that could ingest all Exemplars into ClickHouse and run sampling jobs that sample based on the trace IDs in the Exemplars.
Finally, the in-memory circular ring buffer where Exemplars are stored has very poor retention depending on how many exemplars come in. In certain cases, we have seen that it can only hold 15 minutes worth of Exemplars which is not as usually as one would want it to be. Yet another ah ha! moment was realizing that we could write a Exemplar query API to serve Exemplars from ClickHouse instead of Prometheus and retain Exemplars for the life of the trace.
What ended up happening is that the metrics platform and the trace platform have built a symbiotic relationship and coexist together to support the broader Observability charter. The binding force for that are Exemplars.
In this presentation we discuss how we built our metrics/trace platform, the novel Exemplar based tail sampler, long term Exemplar queries being served out of Clickhouse and how we are re-imagining ways in which metrics and traces can co-exist together. Audience at the end of this presentation would get simple ideas that they can both implement by themselves and benefit from using/adopting Exemplars in their own ecosystem.