Monitoring with Telegraf and Prometheus
Monitoring the workload on a Mesos cluster is a challenge for which no industry standard exists. There are several distinct problems to solve, specific to Mesos.
Firstly, the sheer scale of possible metrics. Mesos enables and encourages microservice architectures. More services means more metrics - and a scale issue.
Secondly, metric identity resolution. This means understanding relationship between metrics from nodes, frameworks, executors, tasks and containers.
Lastly, cardinality. We must tag metrics with metadata for context, but each combination of label and tags increases the cardinality of the dataset.
We will outline each problem, round up the state of metrics with regard to orchestrated environments, and present a reference solution using Telegraf and Prometheus. All tools (and our own libraries) are free, open source and will be available immediately for use with Mesos.