A Software Engineer’s Guide to Observability

A practical observability guide for Engineering Managers & SREs: understand when to use logs, tracing, and metrics, with policies, examples, and AI-driven SRE.

A Software Engineer’s Guide to Observability
Photo by Ken Cheung / Unsplash

As with 42, the famed Answer to the Ultimate Question from DeepThought, many Observability “how-to” guides give you answers that make you wonder if you even knew the question in the first place.

Who is it for?

This blog post series is for Engineering Managers and SREs who view observability as something the entire team should be able to rely on, rather than a side project that each person approaches in their own way.

It’s also for those who’ve slogged through OpenTelemetry, ELK, or Datadog guides and ended up with working dashboards… but still aren’t sure when to reach for which tool, or why. Here, we take a step back to see the bigger picture before diving into the details.

While we’ll focus primarily on why and when to use them, we’ll also link to detailed policies and reference implementations from our Observability Framework for anyone who’d like to take them as-is or tailor them to their own needs and preferences.

Why write it?

At Blueground, we run a global network of fully furnished, designer apartments through our own proprietary Property Management System. We’re not an observability vendor, but we are pretty obsessed with observability, and we’re fortunate to have a Platform team that treats it like a first-class discipline.

At one point, we decided to step back and rethink how we do observability across all our teams and systems. That’s when we spotted a gap: most guides are tool-first and vendor-driven. They focus on knobs, configs, and integrations rather than helping you figure out what’s actually worth measuring, and what those signals mean once you have them.

So we decided to write the guide we wish we’d had: a high-level map of what each observability component can do for you, and what’s actually worth capturing or measuring - before diving head-first into OpenTelemetry for JVM, Python, or whatever tech stack you call home.

Why observability matters more than ever

Step back for a moment and look at the modern software landscape: everything is becoming increasingly complex and expensive to monitor.

We believe that two forces in particular are driving this shift: the spread of distributed systems and the rise of AI-generated code.

Distributed everything
Microservices are now the default. They’re not perfect, but they’re here to stay. On top of that, modern systems tend to scale horizontally across all of their layers, from the load balancer to the datastore. That means the tool stack is getting bigger, compute and data is spread across many different places and the overall system is harder to reason about.

The AI code-gen era changes the rules.
Agentic AI is now writing and shipping code at a pace we can’t expect to review line by line. That means more features land faster, but also more unsupervised code sneaks into production. If it hasn’t landed at your place yet, it certainly is at the doorstep.

As a result, engineering teams are changing shape: broader skill sets, but lighter on deep domain expertise. That means fewer people who can just “read the stacktrace” and instantly know the cause. Observability now has to close that gap by making systems explain themselves clearly.

To put it simply, AI code-gen together with Agentic AI means:

  • More features, shipped faster.
  • Code that fewer humans read before production
  • Less specialist knowledge in each area

In short, the system architecture is getting more complex, the tool stack is bigger, the data feeds are growing, and the unknown is increasing. Suddenly, engineering teams are spending disproportionate time on resolving issues rather than building. This harms both productivity and developer experience. These are people hired to build the next generation of software, not to moonlight troubleshooting. That’s why observability isn’t a “nice to have” anymore; it’s survival gear.

At Blueground, we‘ve had first-hand experience. We gradually moved from a monolith to a 20+ microservices architecture. On top of that, new product verticals, including a marketplace, acquisitions, rotations, team restructurings, senior team members moving on, and systems being transferred from one team to another. And then Agentic coding with Copilot, Cursor, Claude Code, opencode.ai, Bolt, Lovable, and the list goes on. That is what led us to rethink our telemetry from scratch.

Observability Goals

Observability is the ability to understand a system’s behavior through the data it generates. Done well, observability gives you:

  • Faster problem-solving - Less time spent hunting, more time fixing.
  • Fewer major incidents - Spot issues before they get loud.
  • Better performance - Use fewer resources without breaking things.
  • More focus time - Fewer context switches for the team.

It’s as much about engineering quality of life (Developer Experience) as it is about uptime.

Observability Components

What does Observability look like in practice? 

Think of it as a set of complementary ways to understand what’s going on inside your systems, each answering different kinds of questions.

You’ve probably seen the “three pillars” before:

  • Logging:  Written records of what happened. Used primarily for troubleshooting of an already identified problem. Not for finding out you have one to begin with.
  • Tracing (APM): Where a request goes and what it does along the way. Used for both identifying and troubleshooting performance issues at the request/execution-thread level.
  • Metrics & SLOs: Numbers that tell you how the system is behaving. These numbers give the big-picture view of how the system is performing and whether it’s meeting expectations. They’re what you check first to see if something’s wrong - while logs and traces are for figuring out exactly why.

We’ll also talk about:

  • Dashboards & Monitors: Where you make sense of the above.
  • AI SRE: Where AI starts doing some of the heavy lifting

So, here’s how this blog post series is laid out:

  • Intro (this one)
  • Part 1 - Logging: Capturing events in a way that’s useful later for troubleshooting
  • Part 2 - APM & Tracing: Trace requests across systems to troubleshoot performance (and collect performance metrics).
  • Part 3 - Metrics & SLOs: Describe your system in numbers and use them to spot issues.
  • Part 4 - Observing your data through Dashboards & Monitors: Turning raw data into something you can actually act on.
  • Part 5 - AI SRE: How can AI help with Observability and On-Call

So, next up is Logging.

As a bonus, we’ll also share our full Logging Policy, complete with the nitty-gritty: data attributes, log message formats, and best practices. You can use it straight out of the box or tweak it to fit your stack and log aggregation tool of choice.

Stay tuned, and let’s start with the logs.