Designing Logs That Actually Help When Production Is on Fire
Most teams do not design logs.
They react to incidents.
Someone adds a console.log, fixes the bug, and moves on. Over time, production logs become a noisy graveyard of half-sentences, inconsistent formats, missing context, and emotionally-charged messages like:
“wtf is this even happening”
Then one day, production goes down at 2:47 AM, and suddenly those logs are all you have.
This article is about designing logs intentionally, so that any engineer—on-call or not—can understand what happened without opening the source code.
Logs Are Not Debug Statements
This is the first mindset shift.
Logs are operational tools, not developer notes.
They exist to help engineers:
- Debug incidents
- Investigate user issues
- Trace failures across systems
- Measure system health
If a log message only makes sense to the person who wrote the code, it is not a useful log.
1. Use Structured Logging (Not Free-Text Messages)
Plain text logs are human-readable but machine-useless.
Structured logs (usually JSON) are queryable, filterable, and aggregatable.
Bad
"Payment failed for user 4821"
Good
{
"level": "error",
"event": "payment_failed",
"user_id": "u_4821",
"order_id": "ord_99234",
"error_code": "CARD_DECLINED",
"request_id": "req_9f2a",
"duration_ms": 1834
}
Now your observability stack can:
- Alert on
event = payment_failed - Count failure rates by
error_code - Track latency trends using
duration_ms - Correlate errors by
request_id
Design your logs like database rows, not diary entries.
2. Use Log Levels Correctly (Stop Abusing ERROR)
Most teams log everything as ERROR.
This destroys signal quality.
Use levels intentionally:
- INFO → Expected business events
(user logged in, order created, webhook received) - WARN → Recoverable problems
(timeouts with retries, degraded fallbacks, partial failures) - ERROR → Genuine failures needing attention
(payment failed, data corruption, dependency outage) - DEBUG → Development noise
(disabled in production)
Important:
If something is part of normal control flow, it is not an error.
3. Logs Without Context Are Almost Useless
A log message alone rarely answers anything.
Context turns logs into answers.
Always attach context fields:
- Identifiers →
request_id,user_id,order_id,job_id - Values →
amount,count,size - State →
status,current_step,phase - Errors →
error_code,exception_class,stack_trace(sampled) - Timing →
duration_ms,queued_ms,retry_after_ms
These answer:
- Who is affected
- What failed
- Where in the flow it happened
- How often it happens
- How long it took before failing
If your logs cannot answer these, they will not help during incidents.
4. Correlation IDs Are Non-Negotiable
Modern systems are concurrent and distributed.
Without correlation IDs, logs become a shuffled deck of unrelated events.
Generate a unique request_id at every entry point:
- HTTP requests
- Background jobs
- Message queue consumers
- Webhooks
Propagate it everywhere:
- Across services
- Across async boundaries
- Across retries
- Into all logs
Without this, debugging production issues becomes detective work with missing pages.
If you use distributed tracing (OpenTelemetry), also log:
- trace_id
- span_id
5. Never Log Sensitive Data (Ever)
This is both a security risk and a compliance nightmare.
Never log:
- Passwords
- Tokens / API keys
- Card numbers
- Raw SQL with user input
- Full request or response payloads
- PII without explicit masking
Use:
- Masked fields (
card_last4) - Redaction middleware
- Allowlists instead of blocklists
Logs are often retained longer than databases. Treat them as sensitive infrastructure.
6. Add Service Metadata (Critical in Microservices)
Logs without service metadata become useless in distributed systems.
Include:
- service name
- environment (prod/staging/dev)
- version or commit SHA
- region / datacenter
- container or pod id
This allows you to answer:
- “Did this break after the last deploy?”
- “Is this only happening in one region?”
- “Is one pod misbehaving?”
7. Control Noise with Sampling and Deduplication
High-volume logs destroy observability.
If every successful request logs 10 lines, your signal is buried.
Best practices:
- Sample repeated errors
- Deduplicate identical stack traces
- Convert high-frequency success logs into metrics
- Rate-limit spammy logs
Logs should be diagnostic tools, not spam.
8. Log Intent, Not Just Outcomes
Most logs say what happened.
Good logs explain why the system chose a path.
Example:
Bad
"Fallback triggered"
Good
{
"event": "cache_fallback_used",
"reason": "redis_timeout",
"timeout_ms": 500,
"request_id": "req_9f2a"
}
This tells future engineers:
- What decision was made
- Why it was made
- Under what conditions
9. Design Logs for 3 AM Debugging
Ask yourself:
If I were paged right now, would these logs tell me what to do?
Your logs should answer:
- What broke?
- Who is affected?
- Since when?
- How frequently?
- Is this new?
- Did this correlate with a deploy?
- Is the blast radius growing?
If logs cannot answer these questions, they are not operational logs.
10. Standardize a Logging Contract (Team-Wide)
Without standards, logs decay.
Define a minimal logging contract:
Required fields
- level
- event
- request_id
- service
- env
- version
Recommended fields
- user_id
- resource_id
- duration_ms
- error_code
Enforce this in:
- Code reviews
- Logging wrappers
- Lint rules
- Templates
11. Logs Are Not Metrics (Use Both)
Logs explain why.
Metrics show how bad.
Use:
- Logs for root cause analysis
- Metrics for alerting and trends
- Traces for flow visualization
Trying to use logs as metrics leads to:
- Cost explosions
- Slow dashboards
- Missed alerts
12. Design Logs for Humans and Machines
Good logs:
- Are machine-queryable
- But still human-readable
- Use stable field names
- Avoid cryptic abbreviations
- Avoid emotional messages
Your future teammate may be reading your logs at 4 AM.
Be kind to that person.
The Real Goal of Logging
The goal is not to “have logs.”
The goal is:
Any engineer can open the logs during an incident and understand what happened without opening the source code.
If your logs achieve that, you have done logging correctly.
Finally
Good logging is not accidental.
It is a design discipline.
If your team treats logs as first-class system design, production incidents stop being chaotic and start becoming diagnosable engineering problems.
That is the difference between:
- “Production is down, panic”
and - “We know exactly what broke, why, and where to fix it.”
Comments ()