80% faster, 70% less memory: building a new high-performance, low-cost Prometheus query engine
Room B | Mon 20 Jan 11:40 a.m.–12:25 p.m.
Presented by
-
Joshua Hesketh is a Senior Software Engineer at Grafana Labs. He primarily works on Mimir, which is an open source, horizontally scalable, highly available, multi-tenant TSDB for long-term storage for Prometheus.
-
Charles is a senior engineer on the Mimir team at Grafana Labs. He's worked with teams all around the globe and has a particular interest in developer experience, automation and cloud-native infrastructure.
When he’s not at work, you'll find him travelling, taking photos, eating chocolate and playing with Lego (usually not all at once).
Abstract
We’re building a brand-new Prometheus-compatible query engine for Grafana Mimir which runs up to 80% faster and with up to 70% lower peak memory usage. In this talk, we’ll share how we’ve achieved this, some of the Go performance lessons we’ve learnt, and how you can apply them to your own projects.
Mimir is an open source, horizontally scalable, highly available, multi-tenant TSDB for long-term storage for Prometheus.
Mimir is great at ingesting enormous amounts of time series data. But we think it can be even better at querying enormous amounts of time series data. So we’ve been working to improve Mimir’s query performance and resource consumption, with the goal to evaluate queries faster while also reducing CPU utilisation and peak memory consumption.
Our new query engine has been designed to deliver an improved user experience and vastly improved performance: our benchmarks show queries running up to 80% faster and with 70% lower peak memory consumption than Prometheus’ default engine, and our real-world testing shows similar results.
As we’ve been building the engine, we’ve learnt a number of Go performance lessons the hard way, including why using byte slices can sometimes be preferable to strings, the benefits and costs of memory pooling and the surprisingly large impact of function pointers. And we’ve seen the complexity (and bugs!) these things can introduce too, and developed a number of techniques to help combat this.
In this talk, you’ll:
- Get a peek inside the engine and some of the key design decisions that have enabled these results
- Learn some of the pros and cons of the new engine vs. Prometheus’ default PromQL engine and Thanos’ streaming PromQL engine
- See how Mimir’s other query optimisation techniques, such as streaming chunks and time splitting of queries, uniquely complement a PromQL engine that computes results over streams of series
- Learn some of the Go performance lessons we’ve learnt along the way: the things that worked, the things that didn’t, and the thing that later caused us a day of hunting down a hard-to-replicate bug
- Learn some of the techniques we’ve implemented to combat the issues some of these performance optimisations can introduce
- Learn how to apply these ideas to your own projects
- Hear what we plan to do next to improve the engine even further
We’re building a brand-new Prometheus-compatible query engine for Grafana Mimir which runs up to 80% faster and with up to 70% lower peak memory usage. In this talk, we’ll share how we’ve achieved this, some of the Go performance lessons we’ve learnt, and how you can apply them to your own projects. Mimir is an open source, horizontally scalable, highly available, multi-tenant TSDB for long-term storage for Prometheus. Mimir is great at ingesting enormous amounts of time series data. But we think it can be even better at querying enormous amounts of time series data. So we’ve been working to improve Mimir’s query performance and resource consumption, with the goal to evaluate queries faster while also reducing CPU utilisation and peak memory consumption. Our new query engine has been designed to deliver an improved user experience and vastly improved performance: our benchmarks show queries running up to 80% faster and with 70% lower peak memory consumption than Prometheus’ default engine, and our real-world testing shows similar results. As we’ve been building the engine, we’ve learnt a number of Go performance lessons the hard way, including why using byte slices can sometimes be preferable to strings, the benefits and costs of memory pooling and the surprisingly large impact of function pointers. And we’ve seen the complexity (and bugs!) these things can introduce too, and developed a number of techniques to help combat this. In this talk, you’ll: - Get a peek inside the engine and some of the key design decisions that have enabled these results - Learn some of the pros and cons of the new engine vs. Prometheus’ default PromQL engine and Thanos’ streaming PromQL engine - See how Mimir’s other query optimisation techniques, such as streaming chunks and time splitting of queries, uniquely complement a PromQL engine that computes results over streams of series - Learn some of the Go performance lessons we’ve learnt along the way: the things that worked, the things that didn’t, and the thing that later caused us a day of hunting down a hard-to-replicate bug - Learn some of the techniques we’ve implemented to combat the issues some of these performance optimisations can introduce - Learn how to apply these ideas to your own projects - Hear what we plan to do next to improve the engine even further