Meshing Your Way to Better Observability
We have significantly improved the detection and mitigation time of mesh/network-related incidents by implementing an ability to isolate and measure latencies across mesh components. The following proposal elaborates on how we’ve overcome this challenge.
Measuring infrastructure latency has been a long-standing challenge, as latencies are hard to isolate and measure. Mesh-related incident detection and restore time need to be improved, as we didn’t have breakdown data and isolated measurements for each of our mesh components and all of the requests. In short, setting SLI/SLOs on infrastructure latency by aggregating accurate metrics across all mesh components will give us the ability to detect and mitigate infrastructure incidents efficiently.
Given the challenges we have faced, we have concluded that sampled metrics (i.e., Istio/client metrics) are not sufficient for this task due to the inability to detect minor changes with such resolutions. In addition, client metrics do not necessarily indicate network issues.
The implemented solution is based on the idea of injecting headers along the way for each of the mesh components. The latency is measured by Envoy sidecars on the request flow, and measurements are injected into it. By subtracting between measurement values, we can isolate and construct fitted histograms (there will be a diagram), and thus, we can allow high accuracy, which contributes to early incident detection time.
We have implemented designated dependencies for our main framework components, covering our workloads extensively. When the client-side host receives a response, we extract the header measurements, process them, and expose latency-accurate histograms and trace spans.
Using the power of openTelemetry tracing, we can insert custom spans into actual application traces. Measurements are shown as separate spans, allowing us to distinguish between application and mesh/network latency issues in the trace and also pointing to related components.
Using the data produced and collected, we have built end-to-end visibility of latency breakdown for each of our mesh/network components. Constructing SLI/SLOs for all mesh/network components enables us to use anomaly detection methods to spot and alert automatically and to triage incidents efficiently (with proven examples). This includes showing latency from the services perspective, which allows us to shorten the time we detect and mitigate network-related issues.