So you want to build an Incident Response stack using OpenTelemetry?
In a myriad of open source tooling, how should users build an efficient incident response stack?
Lets start out by talking about what constitutes Incident Response Management and its principles. On-call duties are important but also stressful at the same time. We'll discuss how we can manage these as a company grows, splits into teams and timezones.
From here we'll talk about:
1. What are good signals to collect from an application that will help debug and diagnose production issues?
In this section, we will discuss properties of telemetry signals collected from applications; Metrics are aggregatable, Logs are point in time events, and Traces are transactional. We will cover an overview of OpenTelemetry and how the project defines an open data format for the observability community to innovate on. We will discuss various components of the OpenTelemetry project and how it can help users instrument applications for the pillars of observability.
2. Next, what kind of query patterns should be allowed by a telemetry database and how can we use them efficiently?
In this section, we will see how telemetry data collected can be efficiently queried to generate insights
3. How can I use these query patterns defined by telemetry databases to build an incident response stack?
Here we will bring together ideas from the first two sections to lay down a foundation for an incident response stack that includes alerts, SLOs and machine learning algorithms to detect when a system is not behaving normally. We will discuss how these systems can help users respond to incidents quickly and reach the root cause of issues.