Software Engineering

Designing Logs That Actually Help When Production Is on Fire

Most teams do not design logs.
They react to incidents.

Someone adds a console.log, fixes the bug, and moves on. Over time, production logs become a noisy graveyard of half-sentences, inconsistent formats, missing context, and emotionally-charged messages like:

“wtf is this even happening”

Then one day, production goes down at 2:47 AM, and suddenly those logs are all you have.

This article is about designing logs intentionally, so that any engineer—on-call or not—can understand what happened without opening the source code.

Logs Are Not Debug Statements

This is the first mindset shift.

Logs are operational tools, not developer notes.
They exist to help engineers:

Debug incidents
Investigate user issues
Trace failures across systems
Measure system health

If a log message only makes sense to the person who wrote the code, it is not a useful log.

1. Use Structured Logging (Not Free-Text Messages)

Plain text logs are human-readable but machine-useless.

Structured logs (usually JSON) are queryable, filterable, and aggregatable.

Bad

"Payment failed for user 4821"

Good

{
  "level": "error",
  "event": "payment_failed",
  "user_id": "u_4821",
  "order_id": "ord_99234",
  "error_code": "CARD_DECLINED",
  "request_id": "req_9f2a",
  "duration_ms": 1834
}

Now your observability stack can:

Alert on event = payment_failed
Count failure rates by error_code
Track latency trends using duration_ms
Correlate errors by request_id

Design your logs like database rows, not diary entries.

2. Use Log Levels Correctly (Stop Abusing ERROR)

Most teams log everything as ERROR.
This destroys signal quality.

Use levels intentionally:

INFO → Expected business events
(user logged in, order created, webhook received)
WARN → Recoverable problems
(timeouts with retries, degraded fallbacks, partial failures)
ERROR → Genuine failures needing attention
(payment failed, data corruption, dependency outage)
DEBUG → Development noise
(disabled in production)

Important:
If something is part of normal control flow, it is not an error.

3. Logs Without Context Are Almost Useless

A log message alone rarely answers anything.

Context turns logs into answers.

Always attach context fields:

Identifiers → request_id, user_id, order_id, job_id
Values → amount, count, size
State → status, current_step, phase
Errors → error_code, exception_class, stack_trace (sampled)
Timing → duration_ms, queued_ms, retry_after_ms

These answer:

Who is affected
What failed
Where in the flow it happened
How often it happens
How long it took before failing

If your logs cannot answer these, they will not help during incidents.

4. Correlation IDs Are Non-Negotiable

Modern systems are concurrent and distributed.
Without correlation IDs, logs become a shuffled deck of unrelated events.

Generate a unique request_id at every entry point:

HTTP requests
Background jobs
Message queue consumers
Webhooks

Propagate it everywhere:

Across services
Across async boundaries
Across retries
Into all logs

Without this, debugging production issues becomes detective work with missing pages.

If you use distributed tracing (OpenTelemetry), also log:

trace_id
span_id

5. Never Log Sensitive Data (Ever)

This is both a security risk and a compliance nightmare.

Never log:

Passwords
Tokens / API keys
Card numbers
Raw SQL with user input
Full request or response payloads
PII without explicit masking

Use:

Masked fields (card_last4)
Redaction middleware
Allowlists instead of blocklists

Logs are often retained longer than databases. Treat them as sensitive infrastructure.

6. Add Service Metadata (Critical in Microservices)

Logs without service metadata become useless in distributed systems.

Include:

service name
environment (prod/staging/dev)
version or commit SHA
region / datacenter
container or pod id

This allows you to answer:

“Did this break after the last deploy?”
“Is this only happening in one region?”
“Is one pod misbehaving?”

7. Control Noise with Sampling and Deduplication

High-volume logs destroy observability.

If every successful request logs 10 lines, your signal is buried.

Best practices:

Sample repeated errors
Deduplicate identical stack traces
Convert high-frequency success logs into metrics
Rate-limit spammy logs

Logs should be diagnostic tools, not spam.

8. Log Intent, Not Just Outcomes

Most logs say what happened.
Good logs explain why the system chose a path.

Example:

Bad

"Fallback triggered"

Good

{
  "event": "cache_fallback_used",
  "reason": "redis_timeout",
  "timeout_ms": 500,
  "request_id": "req_9f2a"
}

This tells future engineers:

What decision was made
Why it was made
Under what conditions

9. Design Logs for 3 AM Debugging

Ask yourself:

If I were paged right now, would these logs tell me what to do?

Your logs should answer:

What broke?
Who is affected?
Since when?
How frequently?
Is this new?
Did this correlate with a deploy?
Is the blast radius growing?

If logs cannot answer these questions, they are not operational logs.

10. Standardize a Logging Contract (Team-Wide)

Without standards, logs decay.

Define a minimal logging contract:

Required fields

level
event
request_id
service
env
version

Recommended fields

user_id
resource_id
duration_ms
error_code

Enforce this in:

Code reviews
Logging wrappers
Lint rules
Templates

11. Logs Are Not Metrics (Use Both)

Logs explain why.
Metrics show how bad.

Use:

Logs for root cause analysis
Metrics for alerting and trends
Traces for flow visualization

Trying to use logs as metrics leads to:

Cost explosions
Slow dashboards
Missed alerts

12. Design Logs for Humans and Machines

Good logs:

Are machine-queryable
But still human-readable
Use stable field names
Avoid cryptic abbreviations
Avoid emotional messages

Your future teammate may be reading your logs at 4 AM.
Be kind to that person.

The Real Goal of Logging

The goal is not to “have logs.”

The goal is:

Any engineer can open the logs during an incident and understand what happened without opening the source code.

If your logs achieve that, you have done logging correctly.

Finally

Good logging is not accidental.
It is a design discipline.

If your team treats logs as first-class system design, production incidents stop being chaotic and start becoming diagnosable engineering problems.

That is the difference between:

“Production is down, panic”
and
“We know exactly what broke, why, and where to fix it.”

Designing Logs That Actually Help When Production Is on Fire

Logs Are Not Debug Statements

1. Use Structured Logging (Not Free-Text Messages)

2. Use Log Levels Correctly (Stop Abusing ERROR)

3. Logs Without Context Are Almost Useless

4. Correlation IDs Are Non-Negotiable

5. Never Log Sensitive Data (Ever)

6. Add Service Metadata (Critical in Microservices)

7. Control Noise with Sampling and Deduplication

8. Log Intent, Not Just Outcomes

9. Design Logs for 3 AM Debugging

10. Standardize a Logging Contract (Team-Wide)

11. Logs Are Not Metrics (Use Both)

12. Design Logs for Humans and Machines

The Real Goal of Logging

Finally

Support Us

Share to Friends

Read next

The Linear SDLC Is Dead — Long Live Continuous Software Delivery

In-Iteration Rework in Software Engineering: Normal, Costly, and Often Misunderstood

Software Engineers Are Not Paid to Write Code — They’re Paid to Solve Problems

Comments ()

Logs Are Not Debug Statements

1. Use Structured Logging (Not Free-Text Messages)

2. Use Log Levels Correctly (Stop Abusing ERROR)

3. Logs Without Context Are Almost Useless

4. Correlation IDs Are Non-Negotiable

5. Never Log Sensitive Data (Ever)

6. Add Service Metadata (Critical in Microservices)

7. Control Noise with Sampling and Deduplication

8. Log Intent, Not Just Outcomes

9. Design Logs for 3 AM Debugging

10. Standardize a Logging Contract (Team-Wide)

11. Logs Are Not Metrics (Use Both)

12. Design Logs for Humans and Machines

The Real Goal of Logging

Finally

Support Us

Share to Friends

Read next

Comments ( )

Comments ()