Everything Open 2025 | Presentation: What happened in production?! Instrumenting with OpenTelemetry

Presented by

David Bell
@dtbell91@aus.social

David is a DevOps engineer who has long been interested abandoning intuition and "gut feel" for solid data to better answer the question "are my production systems healthy?" (spoilers: they probably aren't) and helping teams answer the age old "why does it do that in prod? it doesn't do that on my machine!"

Abstract

You could almost set your watch by it: at 2pm daily the microservice would time out and crash, the database growing increasingly slow and deadlock prone, and the SLA perilously close to failing. Everything looked "normal" - the logs showed typical requests and responses right up until it all fell over, the metrics showed the API received more requests at other times of the day so it wasn't overwhelmed and had capacity, but something was different. Was it a noisy neighbour problem on the shared database? Something malicious not caught by the WAF? Solar flares? What was going on?!

Join us on a journey into the unknown-unknowns with our guide O11y (pronounced "Ollie", short for "Observability") as we explore: - Observability and its "three pillars" - OpenTelemetry Tracing - Auto- and Manual-Instrumentation - High Cardinality, High Dimensionality, and Sampling - Honeycomb.io's querying and trace rendering