A Software Engineer’s Guide to Observability: Part 1 - Logging

Blueground Engineering’s observability guide to logging: why log, what to capture, JSON structure, correlation IDs, and AI-ready practices with policy and refs

A Software Engineer’s Guide to Observability: Part 1 - Logging
Photo by Matt LaVasseur / Unsplash

Why Log - The role of logs in observability

In complex, distributed systems, failure is not a matter of if but when. When something fails, your ability to recover quickly depends on how fast you can understand exactly what happened - and that’s where logs have traditionally been our most valuable ally: Our primary forensics tool.

For logs to be truly useful, they should allow us to:

  • Reconstruct timelines: Logs should provide a chronological history of events, ideally across systems (see Correlating Logs), making it possible to trace the sequence of events leading up to a particular event or failure.
  • Get the full context: Logs should capture granular details about system state, user actions, and variables at the moment an event occurs. This context is invaluable for reconstructing what led to a system state or problem.
  • Identify root causes: Logs should include error types, codes, and stack traces, making it easier to pinpoint an issue down to the running node, thread, and offending line of code.
  • Validate system behavior: After deploying a change, checking the logs is an efficient way to validate that the system works as intended, providing detailed insight into its execution flow.

So, logs can be great for troubleshooting, but do they have other use cases?

While logs can serve purposes beyond troubleshooting, the modern observability stack,  as we’ll cover in this series, offers better tools for most of them. Our recommendation: treat logs as what they’re best at, a forensics tool for digging into problems once they’re found, not for spotting them in the first place. Any other use should be the exception, not the rule.

👉 Logs are your detective, not your lookout.

Who to log for?

Before we discuss what and where to log, it is essential to consider who we are logging for: Who’s reading our logs?

That may seem like a silly question. Meaning, until recently, it was only engineers who peered through elaborate user interfaces full of facets and complex search expressions. But as logs have become more structured and context-rich, they are now perfect candidates for feeding into large language models (LLMs). These AI systems can ingest and interpret massive volumes of log data, rapidly surfacing patterns, anomalies, and correlations that might take humans hours to uncover.

This evolution transforms logs from a passive record for manual investigation into an active, machine-readable signal source, enabling AI agents to assist with incident analysis, propose remediation steps, and continuously learn from operational history. 

We believe in a “vibe-debugging” future, where troubleshooting takes the form of a chat interface that neatly summarizes and verifies insights from our logs and other signals. The upside is enormous, and we should already be working towards that. But we also believe that AI crunching our logs is not an excuse for not taking care of them. On the contrary, we think that logs will get an exponential return with AI tools.

👉 Treat your logs with extra care - AI will only make them exponentially more valuable.

What to Log – Capturing the correct information

We can't log everything. We must choose what to log.

Before you start sprinkling log.info() or console.log() calls everywhere, ask yourself: “How can this log be useful?”, “Do I capture all the information needed to make it useful?” Logging isn’t about dumping events. It’s about recording all events that could be of interest, discarding those that cannot, and doing so with sufficient context to provide maximum value. Ideally, every log entry should be intentional, carrying information that will help someone, human or AI, understand, reproduce, and act on what occurred.

Key application events

Make sure you always log the following runtime events:

  • Application initialization
  • Logging initialization
  • Incoming requests
  • Outgoing requests
  • “Use case“ initiation/completion (eg payment_initiated/ payment_completed)
  • Application errors
  • Input and output validation failures
  • Authentication successes and failures
  • Authorization failures
  • Session management failures
  • Privilege elevation successes and failures
  • Other higher-risk events, like data import and export or bulk updates

And consider logging any other events that may help you with:

  • Troubleshooting
  • Testing in pre-production environments
  • Understanding user behavior
  • Security and auditing
  • Monitoring and performance (don’t get crazy with this though, APM will be your primary tool here)

Data attributes 

For you to make the most out of those logged events, you’ll need to see them in context. 

Nemo dat quod non habet (no one gives what they do not have)

To deliver full context, logs must first capture it. Here’s what we think the minimum context your logs should have. We define it as a set of data attributes, and recommend using JSON as the standard log format for all environments, except for local development.

See our logging policy guide for the full expanded list of attributes we use at Blueground.

Core attributes
Fixed fields with special semantics (used for routing, faceting, and correlation). Usually auto‑populated by agents/ingestors.
Examples: host, log level, service, trace_id, message, date

Source code attributes
Understanding where a log originates in the code is key to understanding system behavior.
Examples: logger.name, logger.thread_name, logger.method_name

Network attributes
Low‑level network context for inbound/outbound connections - super helpful in troubleshooting by filtering in specific log entries (e.g., those from a particular IP or with a large payload)
Examples: network.bytes_read, network.client.ip

Error attributes
Standardized error information to quickly classify and triage failures.
Examples:  error.kind, error.message, error.stack

HTTP attributes
HTTP request/response context for web traffic—great for filtering by endpoint, status, agent, etc.
Examples: http.url, http.method, http.status_code

Performance attributes
Measures that quantify how long something took—used for latency/SLO views and trace search defaults.
Examples: duration

User attributes
Who initiated the action—attach only when the flow is user‑driven (mind PII policies)
Examples: user.id, user.email, user.name

Domain attributes
Capturing common business context that’s shared across multiple services, allows logs to be correlated consistently across the organization. By standardizing the naming of these attributes, teams can query and analyze data more effectively. For example, if your booking service, payment service, and invoicing service all reference the same booking, you can adopt a shared field name like domain.booking.id

Service attributes
Service attributes capture fields that are specific to a particular service’s internal logic, without overlapping with the shared domain.* namespace. These attributes are namespaced under the service name to avoid collisions and to clearly indicate their origin.

For example, in a payment service called Payflow, an attribute like payflow.paymentGateway could store the gateway provider used for a transaction (stripe, adyen, other).

This ensures that service-specific details remain clearly scoped, while still allowing them to be combined with shared domain.* attributes for richer cross-service analysis.

Correlation ID
A unique trace ID facilitates log correlation across different application layers and distributed systems encompassing multiple services.

Entrypoint
The entrypoint within the application that eventually led to this log entry, e.g., http/api, kafka/consumer, scheduled_job. Useful for root cause analysis.

So here’s what a full-blown log record looks like.

{
    "timestamp": "2025-09-05T14:26:03.702Z",
    "host": "hermes-569c4b8d6f-q2qxg",
    "service": "hermes",
    "message": "[res] POST /api/sms/send 200 (36ms)",
    "attributes": {
      "dd": {
        "service": "hermes",
        "env": "production",
        "version": "production-aa7a86f"
      },
      "duration": 36,
      "hostname": "hermes-569c4b8d6f-q2qxg",
      "level": 30,
      "logger": {
        "name": "koa-server"
      },
      "correlation_id": "Root=1-68baf2fb-57c323aa7a6b945911be71fb",
      "entrypoint": "http/api",  
      "http": {
        "headers": {
          "content-length": "141",
          "x-datadog-sampling-priority": "1",
          "x-correlation-id": "Root=1-68baf2fb-57c323aa7a6b945911be71fb",
          "accept": "application/json, application/cbor, application/*+json",
          "service-name": "guest-app",
          "host": "hermes-svc:3000",
          "content-type": "application/json",
          "connection": "keep-alive",
          "accept-encoding": "gzip, x-gzip, deflate",
          "user-agent": "Apache-HttpClient/5.2.3 (Java/17.0.12)"
        },
        "url_details": {
          "path": "/api/sms/send",
          "scheme": "http",
          "port": 3000,
          "host": "hermes-svc"
        },
        "status_code": 200,
        "method": "POST",
        "useragent": "Apache-HttpClient/5.2.3 (Java/17.0.12)",
        "version": "1.1",
        "url": "/api/sms/send"
      },
      "pid": 62,
      "status": "info",
      "network": {
        "bytes_written": 644,
        "client": {
          "internal_ip": "::ffff:172.31.226.32",
          "port": 44518,
          "ip": "::ffff:172.31.226.32"
        },
        "bytes_read": 656
      },
      "timestamp": 1757082363702
    },
    "tags": [
      "kube_service:hermes-svc",
      "kube_deployment:hermes",
      "display_container_name:hermes_hermes-569c4b8d6f-q2qxg",
      "service:hermes",
      "kube_replica_set:hermes-569c4b8d6f",
      "kube_namespace:production",
      "kube_container_name:hermes",
      "env:production",
      "source:nodejs"
      /* .. */
    ]
}

How to Log – Practices for useful, consistent logs

If you haven’t read the 12-factor app guidelines on logging, start there. It’s the right baseline: logs are streams, not files, and your platform should handle storage and aggregation. What follows builds on that foundation.

Key principles

  1. Structured — Use a format like JSON so both humans and machines can parse them.
  2. Consistent — Use the same field names and structure across all services.
  3. Context-rich — Include environment, service name, version, trace IDs, and user/session data.
  4. Semantic — Describe meaningful events (payment_failed, booking_created) rather than generic messages.
  5. Balanced between agnostic & domain-specific fields:
    • Domain-agnostic attributes: Common across all services (e.g., trace_id, env, service, request_id).
    • Domain-specific attributes: Specific to a service’s business logic (e.g., booking_id for a booking service, market for a regional pricing service).

Format: JSON default, Text in dev

In cloud-native, distributed systems, log format consistency is essential for effective machine parsing, indexing, and alerting. Use structured formats like JSON by default to ensure that logs can be easily consumed by observability tools, correlated across services, and filtered programmatically.

{
  "timestamp": "1714131374032",
  "logger": "foo.bar",
  "msg": "This is a log message",
  "host": "i-f823e12ac",
  "service": "blue",
  "source": "k8s/spring-boot",
  "status": "info",
}

That said, during local development, human-readable plain text can sometimes be more practical. Developers benefit from simple logs with clear formatting for quick debugging without needing a log viewer or parser. You can use a delimiter to separate between the different parts of the log message.

[Fri Apr 26 2024 14:36:17] | INFO | This is a log message
  host=i-f823e12ac
  service=blue-brain
  source=k8s/spring-boot

The log Message

We highly recommend using a generic message format for standard log events. We chose the following as our default one:

[SCOPE] TEXT (key=value)*

SCOPE: Optional context in square brackets. Typically, the name of a use case, command, or edge case is being handled. E.g. [CreateBooking], [UpdateUser], [SendEmailJob]. The context is redundant when it’s semantically equivalent to the logger.name

TEXT: The human-readable description of the log event. See Writing style for details.

(key=value): Key/Value pairs go at the end and are enclosed in parentheses.

Logging HTTP requests

HTTP log events follow a particular message format that allows for very fast eye scanning among hundreds of events. Since applications typically both accept and issue HTTP requests, we employ a slightly different format between the two.

# Incoming Request
[req] METHOD PATH

# Outgoing Response
[res] METHOD PATH STATUS_CODE STATUS_TEXT (duration)

# Outgoing Request
-> [req] [host] METHOD PATH

# Incoming Response
<- [res] [host] METHOD PATH STATUS_CODE STATUS_TEXT (duration)

Should we log incoming requests or only responses?

Recommendation: No by default, but make it available under a feature flag.

  • Logging incoming requests == increased logging management costs.
  • Logging incoming requests == increased log verbosity
  • Logging incoming requests == ability to identify poisonous requests that may hang the server before it gets a chance to log a response. Without logging incoming requests, poisonous requests could go “stealth”.

Examples:

# INCOMING REQUEST
[req] GET /foo/bar?qux=baz
[req] POST /foo/bar

# OUTGOING RESPONSE
[res] GET /foo/bar?qux=baz 200 OK (0.17s)
[res] POST /foo/bar 201 CREATED (0.52s)

# OUTGOING REQUEST
-> [req] [foo-svc] GET /foo/bar?qux=baz
-> [req] [exchangerate.com] GET /api/rates

# INCOMING RESPONSE
<- [res] [foo-svc] GET /foo/bar?qux=baz 200 OK (0.2s)
<- [res] [exchangerate.com] GET /api/rates 200 OK (0.3s)

GraphQL carries minimal to no information in its URL path, so it requires special treatment when it comes to logging. Check our detailed Logging Policy for recommendations on logging GraphQL requests.

Logging Errors

A question that comes up very often is "where should we log an error"?

We recommend logging at the layer where you are handling the error, not before. Sometimes, however, we may need to transform an error and provide additional context to it before letting it bubble up the stack. For those cases, logging and rethrowing is just fine. Please see our Logging Policy for more information.

What errors should we log?

    • Unhandled exceptions
    • Unexpected Errors
      • Database Errors
      • Failed calls to external APIs or microservices.
      • Critical business operations that fail, such as order processing or payment transactions.
    • Resource Failures
      • Insufficient resources (e.g., memory, disk space).
      • Failures to read or write critical files.
      • Network connectivity issues affecting application performance.
    • HTTP 500s should also be logged as errors

Correlating Logs

When an incident hits, one of the fastest ways to cut through the noise is to be able to follow a single request, user action, or job execution across multiple services. That’s what correlation IDs give you: a common thread that ties together logs from different components in a distributed system.

Why it’s important.

Without correlation, troubleshooting becomes a tedious "search-and-guess" exercise. You might find the error log in Service A but have no easy way to locate the related events in Service B or the upstream API. With correlation IDs, you can instantly pull the full picture: every log line from every service that touched that request. 

Why not rely solely on APM tracing libraries?

Tracing tools like OpenTelemetry, Jaeger, or Datadog APM are powerful, but they’re not a substitute for log correlation. That’s important to understand. Traces can be incomplete due to sampling, dropped spans, or unsupported libraries. Logs are your always-on, high-fidelity source of truth, and correlation IDs ensure they can be stitched together even when tracing data is missing.

Implementation guidelines.

All applications should accept and propagate a correlation ID from all entry points to all exit points, storing it in the logging framework’s MDC (Mapped Diagnostic Context) or equivalent Node.js , Python MDC.

We make use of HTTP, Kafka, and AMQP, so we make sure to do the following:

HTTP

  • Inbound requests: Set the MDC from the x-correlation-id HTTP header.
  • Outbound requests: Set the x-correlation-id HTTP header from the MDC.

Kafka

  • Consumers: Set the MDC from the x-correlation-id message header.
  • Producers: Set the message x-correlation-id header from the MDC.

RabbitMQ / AMQP

  • Consumers: Set the MDC from the correlation-id message property.
  • Producers: Set the message correlation-id property from the MDC.

Pro tip: Always generate a correlation ID if the caller doesn’t provide one. Use a consistent format (e.g., UUIDv4) and document it so other teams and services can interoperate cleanly.

Things to avoid

Unstructured logs
Hard to search, filter, or parse. AI tools can’t process them effectively. For production environments, always emit JSON with a consistent schema.

Inconsistent field names
Using different attributes like req_id , request_id, requestId for capturing the same info across services can break cross-service searches and aggregations. Maintain a shared schema document and enforce via CI or AI-assisted code reviews.

Excessive noise
When applications log too much, the signal gets buried under irrelevant chatter. Full payload dumps, repeated debug lines, or unfiltered objects quickly overwhelm log storage and make it harder for engineers to spot what actually matters during troubleshooting. This not only drives up ingestion costs but also slows down queries and dashboards, turning observability into a liability rather than an asset. Instead, logs should capture only the meaningful parts of a payload (e.g., IDs, statuses, counts) so that teams get the essential context without drowning in noise.

Missing trace / correlation IDs
Makes distributed tracing across logs impossible. Always propagate trace_id and span_id via HTTP headers or Kafka/RabbitMQ message headers.

Logging sensitive data
Exposes sensitive information such as emails, user IDs, credit card numbers, or health data. This creates:

  • Compliance risks with regulations like GDPR, PCI-DSS, HIPAA.
  • Security risks if logs are leaked, improperly stored, or accessed by unauthorized parties.
  • Increased breach impact, as logs often bypass the same access controls as primary systems.

Mask sensitive data before emitting logs. Don’t fully redact unless necessary, partial masking preserves context for troubleshooting.

  • Keeps logs useful for debugging (e.g. identifying a specific user or transaction).
  • Example: j***@example.com still helps trace issues without exposing the full email.
  • Full redaction can make logs too opaque to be actionable.

Where to apply masking?

  • Normally, we suggest doing it in a centralized log aggregator, which provides consistent control across services. Most support regex- or AI-based PII detection and masking.
  • If cost or infrastructure is a concern, implement a shared masking utility/library and enforce its use across all applications in the organization.

High-cardinality fields
Can overwhelm your log platform’s indexing and slow searches. Keep unique identifiers in logs but avoid indexing them unless necessary. For instance, while it’s useful to log a user_session_id for traceability, indexing it rarely adds value and only increases cost. Keep unique identifiers in your logs, but only index the ones you actually query, such as correlation_id or order_id.

Not routing logs to stdout/stderr
In containerized and cloud-native applications, writing logs to local files means they won’t be picked up by the platform’s logging infrastructure (e.g., Kubernetes log collectors, sidecars, or cloud provider agents). This makes centralization, retention, and search much harder. Always write logs to stdout (and stderr for errors). Let the container orchestration platform handle log collection, shipping, and storage via its integrated logging pipeline.

Using application logs as audit logs
Operational logs aren’t designed for compliance — they lack immutability, strict schemas, and the retention and access controls required for audit data. Mixing them creates conflicts, risks data exposure, and leaves compliance gaps. Keep audit logs as a separate stream with its own schema, storage, and governance, independent of application logs.

Logging at the wrong layer
If you log too high up the stack, you may lose important technical details. If you log too deep, you might miss the business context needed to interpret the event. Misplaced logs can also lead to duplicate entries or conflicting information across layers. Log where you have the most relevant context for the event:

Bad:

fun validateAddress(address: Address): Boolean {
    if (address.street.isNullOrBlank()) {
        // We don't have enough context here since validateAddress 
        // may be used in a dozen places.
        logger.info("Street can't be blank")
        return false
    }

    if (address.zipCode.length < 5) {
        // We don't have enough context here since validateAddress 
        // may be used in a dozen places.
        logger.info("Zip code should be at least 5 chars")
        return false
    }

    return true
}

fun createProperty(property: propertyDetails) {
    val isAddressValid = validateAddress(property.address)
    if (!isAddressValid) {
        throw InvalidAddressError()
    }
}

fun createUserProfile(profile: ProfileDetails) {
    val isAddressValid = validateAddress(profile.address)
        if (!isAddressValid) {
        throw InvalidAddressError()
    }
}

Better:

fun validateAddress(address: Address): Boolean {
    if (address.street.isNullOrBlank()) {
        // Only a trace log makes sense here to troubleshoot
        // validateAddress itself if needed
        logger.trace("Street can't be blank. (street=${address.street})")
        return false
    }

    if (address.zipCode.length < 5) {
        // Only a trace log makes sense here to troubleshoot
        // validateAddress itself if needed
        logger.trace("Zip code should be at least 5 chars. (zipCode=${address.zipCode})")
        return false
    }

    return true
}

fun createProperty(property: propertyDetails) {
    val isAddressValid = validateAddress(property.address)
    if (!isAddressValid) {
        logger.info("[CreateProperty] Validation failed. Invalid address ({})", address)
        throw InvalidAddressError()
    }
}

fun createUserProfile(profile: ProfileDetails) {
    val isAddressValid = validateAddress(profile.address)
    if (!isAddressValid) {
        logger.info("[CreateUserProfile] Validation failed. Invalid address ({})", address)
        throw InvalidAddressError()
    }
}

Best:

fun validateAddress(address: Address): AddressValidation {
    if (address.street.isNullOrBlank()) {
        // Only a trace log makes sense here to troubleshoot
        // validateAddress itself if needed
        logger.trace("Street can't be blank. (street=${address.street})")
        return AddressValidation.INVALID_STREET
    }

    if (address.zipCode.length < 5) {
        // Only a trace log makes sense here to troubleshoot
        // validateAddress itself if needed
        logger.trace("Zip code should be at least 5 chars. (zipCode=${address.zipCode})")
        return AddressValidation.INVALID_ZIP
    }

    return AddressValidation.OK
}

fun createProperty(property: propertyDetails) {
    val addressValidation = validateAddress(property.address)
    if (!addressValidation.isOK()) {
        logger.info("[CreateProperty] Address validation failed (type={})", addressValidation.type)
        throw InvalidAddressError()
    }
}

fun createUserProfile(profile: ProfileDetails) {
    val isAddressValid = validateAddress(profile.address)
    if (!isAddressValid) {
        logger.info("[CreateUserProfile] Address validation failed (type={})", addressValidation.type)
        throw InvalidAddressError()
    }
}

Implementing with AI – Using AI to set up and maintain logging

Modern AI coding assistants aren’t just for generating snippets — they can also act as policy enforcers. To make this possible, we provide them with a single source of truth: a link to our full logging policy. With that reference in place, tools like Cursor, Claude, and Copilot can review existing code for compliance. To operationalize this, we established a lightweight file structure that directly points to the logging policy URL and defines commands for applying it.

your-repo/
├── logging-policy/
│   └── logging-policy.yaml        # pure pointer to your Outline URL
├── .cursor/
│   └── commands.apply-logging.md  # `/apply-logging` (Outline-driven)
├── .claude/
     ├── commands.apply-logging.md  # `/apply-logging` (Outline-driven)

Logging policy file

version: 1
policy_outline: "https://theblueground.getoutline.com/s/cbaa9fd2-0a7d-4bde-bfcd-a5b54a5ea98b"

Claude Commands file

# /apply-logging
Context: The Outline URL in ./logging-policy/logging-policy.yaml
Action: Refactor the selected code (or current file) to comply with the Outline. Output a unified diff + short rationale.

Cursor Commands file

# /apply-logging
Task: Read the Outline at the URL inside ./logging-policy/logging-policy.yaml and refactor the current selection/file to comply.
Return: A unified diff (patch) plus a brief rationale referencing the Outline sections used.

Analyzing with AI – Leveraging AI to gain insights from logs

Once logs are structured, consistent, and rich with context, they become not just searchable, but also machine-understandable. This is where AI can significantly enhance operational visibility and speed up incident resolution. Modern SRE AI agents can:

  • Reconstruct incident timelines
    Automatically group and sequence log entries by trace_id, correlation_id, or other linking fields to produce a coherent narrative of what happened across multiple services, regions, or queues.
  • Detect unusual sequences or behaviors
    Identify patterns that deviate from normal operational baselines, for example, multiple payment_failed events without retries, or a spike in booking_cancelled events immediately after a deployment.
  • Correlate logs with metrics, traces, and topology
    Link log anomalies with metric changes (CPU spikes, memory drops, increased queue lag) or distributed traces, then overlay them on a service dependency graph to narrow down probable failure points.
  • Classify and prioritize incidents
    Assign severity levels based on the type of event, affected domains, or historical impact, ensuring critical issues are escalated first.
  • Propose investigative steps and remediation actions
    Suggest relevant dashboards, queries, or log filters to explore next, and in some cases even trigger pre-defined remediation workflows such as restarting a service, rolling back a deployment, or scaling a resource.
  • Learn from past incidents
    Continuously improve detection accuracy by ingesting postmortems, past incident data, and human feedback on AI-generated suggestions.

When coupled with a robust logging schema and trace propagation, AI agents can reduce the time from detection to root cause from hours to minutes turning logs from a passive record into an active, intelligent signal source for both human engineers and automated recovery systems

Key takeaways

  • Logs are most effective when they provide the right context — enough to understand what happened without drowning in noise.
  • The primary role of logs is troubleshooting and forensics, not real-time alerting.
  • Standardization is invaluable: consistent structure and practices make logs usable across teams, systems, and tools.
  • We are entering an era where the primary consumer of logs will be AI, not humans, so clarity and structure are more critical than ever.
  • Logs alone don’t give the full picture of how a request flows across services — in the next post, we’ll explore how tracing and APM fill that gap.

In the next post, we’ll dive into APM and tracing,  the other key forensics tool in the observability kit. Unlike logs, tracing shines a light on the user’s experience, capturing response times, errors, page load speeds, and more.

Resources