Designing Logs That Actually Help When Production Is on Fire

Designing Logs That Actually Help When Production Is on Fire
Photo by Bill Burke / Unsplash

Most teams do not design logs.
They react to incidents.

Someone adds a console.log, fixes the bug, and moves on. Over time, production logs become a noisy graveyard of half-sentences, inconsistent formats, missing context, and emotionally-charged messages like:

“wtf is this even happening”

Then one day, production goes down at 2:47 AM, and suddenly those logs are all you have.

This article is about designing logs intentionally, so that any engineer—on-call or not—can understand what happened without opening the source code.


Logs Are Not Debug Statements

This is the first mindset shift.

Logs are operational tools, not developer notes.
They exist to help engineers:

  • Debug incidents
  • Investigate user issues
  • Trace failures across systems
  • Measure system health

If a log message only makes sense to the person who wrote the code, it is not a useful log.


1. Use Structured Logging (Not Free-Text Messages)

Plain text logs are human-readable but machine-useless.

Structured logs (usually JSON) are queryable, filterable, and aggregatable.

Bad

"Payment failed for user 4821"

Good

{
  "level": "error",
  "event": "payment_failed",
  "user_id": "u_4821",
  "order_id": "ord_99234",
  "error_code": "CARD_DECLINED",
  "request_id": "req_9f2a",
  "duration_ms": 1834
}

Now your observability stack can:

  • Alert on event = payment_failed
  • Count failure rates by error_code
  • Track latency trends using duration_ms
  • Correlate errors by request_id

Design your logs like database rows, not diary entries.


2. Use Log Levels Correctly (Stop Abusing ERROR)

Most teams log everything as ERROR.
This destroys signal quality.

Use levels intentionally:

  • INFO → Expected business events
    (user logged in, order created, webhook received)
  • WARN → Recoverable problems
    (timeouts with retries, degraded fallbacks, partial failures)
  • ERROR → Genuine failures needing attention
    (payment failed, data corruption, dependency outage)
  • DEBUG → Development noise
    (disabled in production)

Important:
If something is part of normal control flow, it is not an error.


3. Logs Without Context Are Almost Useless

A log message alone rarely answers anything.

Context turns logs into answers.

Always attach context fields:

  • Identifiersrequest_id, user_id, order_id, job_id
  • Valuesamount, count, size
  • Statestatus, current_step, phase
  • Errorserror_code, exception_class, stack_trace (sampled)
  • Timingduration_ms, queued_ms, retry_after_ms

These answer:

  • Who is affected
  • What failed
  • Where in the flow it happened
  • How often it happens
  • How long it took before failing

If your logs cannot answer these, they will not help during incidents.


4. Correlation IDs Are Non-Negotiable

Modern systems are concurrent and distributed.
Without correlation IDs, logs become a shuffled deck of unrelated events.

Generate a unique request_id at every entry point:

  • HTTP requests
  • Background jobs
  • Message queue consumers
  • Webhooks

Propagate it everywhere:

  • Across services
  • Across async boundaries
  • Across retries
  • Into all logs

Without this, debugging production issues becomes detective work with missing pages.

If you use distributed tracing (OpenTelemetry), also log:

  • trace_id
  • span_id

5. Never Log Sensitive Data (Ever)

This is both a security risk and a compliance nightmare.

Never log:

  • Passwords
  • Tokens / API keys
  • Card numbers
  • Raw SQL with user input
  • Full request or response payloads
  • PII without explicit masking

Use:

  • Masked fields (card_last4)
  • Redaction middleware
  • Allowlists instead of blocklists

Logs are often retained longer than databases. Treat them as sensitive infrastructure.


6. Add Service Metadata (Critical in Microservices)

Logs without service metadata become useless in distributed systems.

Include:

  • service name
  • environment (prod/staging/dev)
  • version or commit SHA
  • region / datacenter
  • container or pod id

This allows you to answer:

  • “Did this break after the last deploy?”
  • “Is this only happening in one region?”
  • “Is one pod misbehaving?”

7. Control Noise with Sampling and Deduplication

High-volume logs destroy observability.

If every successful request logs 10 lines, your signal is buried.

Best practices:

  • Sample repeated errors
  • Deduplicate identical stack traces
  • Convert high-frequency success logs into metrics
  • Rate-limit spammy logs

Logs should be diagnostic tools, not spam.


8. Log Intent, Not Just Outcomes

Most logs say what happened.
Good logs explain why the system chose a path.

Example:

Bad

"Fallback triggered"

Good

{
  "event": "cache_fallback_used",
  "reason": "redis_timeout",
  "timeout_ms": 500,
  "request_id": "req_9f2a"
}

This tells future engineers:

  • What decision was made
  • Why it was made
  • Under what conditions

9. Design Logs for 3 AM Debugging

Ask yourself:

If I were paged right now, would these logs tell me what to do?

Your logs should answer:

  • What broke?
  • Who is affected?
  • Since when?
  • How frequently?
  • Is this new?
  • Did this correlate with a deploy?
  • Is the blast radius growing?

If logs cannot answer these questions, they are not operational logs.


10. Standardize a Logging Contract (Team-Wide)

Without standards, logs decay.

Define a minimal logging contract:

Required fields

  • level
  • event
  • request_id
  • service
  • env
  • version

Recommended fields

  • user_id
  • resource_id
  • duration_ms
  • error_code

Enforce this in:

  • Code reviews
  • Logging wrappers
  • Lint rules
  • Templates

11. Logs Are Not Metrics (Use Both)

Logs explain why.
Metrics show how bad.

Use:

  • Logs for root cause analysis
  • Metrics for alerting and trends
  • Traces for flow visualization

Trying to use logs as metrics leads to:

  • Cost explosions
  • Slow dashboards
  • Missed alerts

12. Design Logs for Humans and Machines

Good logs:

  • Are machine-queryable
  • But still human-readable
  • Use stable field names
  • Avoid cryptic abbreviations
  • Avoid emotional messages

Your future teammate may be reading your logs at 4 AM.
Be kind to that person.


The Real Goal of Logging

The goal is not to “have logs.”

The goal is:

Any engineer can open the logs during an incident and understand what happened without opening the source code.

If your logs achieve that, you have done logging correctly.


Finally

Good logging is not accidental.
It is a design discipline.

If your team treats logs as first-class system design, production incidents stop being chaotic and start becoming diagnosable engineering problems.

That is the difference between:

  • “Production is down, panic”
    and
  • “We know exactly what broke, why, and where to fix it.”

Support Us

Share to Friends