Observability: Logs, Metrics, and Traces
When a production system breaks at 3am, you need to answer three questions fast: what happened, how bad is it, and where is the bottleneck? Logs tell you what happened — discrete events with timestamps and context. Metrics tell you how bad — aggregated numbers like error rate, latency percentiles, and CPU usage. Traces tell you where — the full journey of a single request across every service and database call.
These three pillars form observability. Monitoring tells you when something is wrong; observability lets you understand why. Modern stacks unify all three with OpenTelemetry — one SDK that exports logs, metrics, and traces to Prometheus, Grafana, Jaeger, Datadog, or AWS X-Ray.
Logs
Logs are timestamped records of discrete events — errors, warnings, info messages, and debug output. A good log line includes a timestamp, severity level, service name, trace ID, and structured context (user ID, order ID, request path). Unstructured logs ("payment failed") are hard to search; structured JSON logs ({"level":"error","order_id":991,"reason":"timeout"}) are queryable at scale.
Centralize logs with ELK (Elasticsearch, Logstash, Kibana), Grafana Loki, or CloudWatch Logs. Never log secrets — passwords, tokens, credit card numbers. Use log levels consistently: ERROR for failures requiring action, WARN for recoverable issues, INFO for business events, DEBUG for development only.
Quick reference
- Best for: debugging specific failures, audit trails, understanding sequence of events.
- Strengths: rich context, human-readable, captures unexpected edge cases.
- Weaknesses: expensive at scale, high cardinality, not good for aggregates or trends.
- Use structured JSON logging — not string concatenation.
- Include trace_id in every log line to correlate with distributed traces.
- Set retention policies — 30 days hot, 90 days cold, archive for compliance.
Remember this
Logs answer "what happened?" — use structured JSON with trace IDs for searchable, correlatable events.
Metrics
Metrics are numeric measurements aggregated over time — counters (total requests), gauges (current memory usage), and histograms (latency distribution). Unlike logs, metrics are cheap to store and query at scale. A Prometheus server scraping /metrics every 15 seconds can track millions of time series across hundreds of services.
Define SLIs (Service Level Indicators) from metrics: availability (uptime %), latency (p99 response time), and error rate (5xx / total). Set SLOs (Service Level Objectives) — "99.9% of requests complete under 200ms" — and alert when error budgets burn too fast. RED method (Rate, Errors, Duration) covers most HTTP services; USE method (Utilization, Saturation, Errors) covers infrastructure.
Quick reference
- Best for: dashboards, alerting, capacity planning, SLO tracking.
- Strengths: cheap storage, fast queries, trend detection, proactive alerting.
- Weaknesses: no per-request detail, loses context, requires upfront instrumentation.
- RED for services: Request rate, Error rate, Duration (latency).
- USE for resources: Utilization, Saturation, Errors.
- Alert on symptoms (high error rate), not causes (high CPU) — CPU might be fine under load.
Remember this
Metrics answer "how much and how fast?" — instrument RED metrics on every service and alert on SLO breaches.
Distributed Traces
A distributed trace follows one request from entry to exit across all services. Each service creates spans — timed operations with metadata. A trace ID links spans across services; a parent span ID shows nesting. When checkout takes 2 seconds, the trace reveals that 1.8 seconds were spent in the payment service calling an external API.
OpenTelemetry is the vendor-neutral standard for generating traces, metrics, and logs. Instrument your code once with OTel SDKs, export to any backend (Jaeger, Tempo, X-Ray, Datadog). In microservices, tracing is essential — without it, debugging cross-service latency is guesswork.
Quick reference
- Best for: latency debugging, cross-service dependency mapping, bottleneck identification.
- Strengths: end-to-end request visibility, shows exact slow span, maps service dependencies.
- Weaknesses: sampling required at scale (trace 1-10% of requests), storage cost, instrumentation effort.
- Use OpenTelemetry — one SDK for traces, metrics, and logs.
- Sample intelligently: always trace errors, sample 1-5% of successful requests.
- Propagate trace context via HTTP headers (traceparent) and message queue metadata.
Remember this
Traces answer "where is it slow?" — use OpenTelemetry and always trace errors, sample the rest.
OpenTelemetry in Production
OpenTelemetry (OTel) provides SDKs for every major language, auto-instrumentation for popular frameworks, and a collector that receives, processes, and exports telemetry to any backend. The typical pipeline: instrument app → OTel SDK → OTel Collector → Prometheus (metrics) + Jaeger (traces) + Loki (logs) → Grafana dashboards.
Start with auto-instrumentation for HTTP, database, and message queue calls — it gives you 80% of value with zero code changes. Add custom spans for business-critical operations (payment processing, order fulfillment). Correlate all three pillars: every log line includes trace_id, every metric dashboard links to traces for anomalous periods.
Quick reference
- Pipeline: App → OTel SDK → Collector → backends (Prometheus, Jaeger, Loki).
- Auto-instrumentation: HTTP, gRPC, database drivers, Redis, Kafka — zero code for basics.
- Custom spans: wrap business operations with tracer.startSpan() for domain visibility.
- Grafana unifies metrics (Prometheus), traces (Tempo), and logs (Loki) in one UI.
- Define dashboards per service: RED metrics + error trace links + recent error logs.
- Run the OTel Collector as a sidecar or daemonset — not embedded in every app process.
Remember this
Standardize on OpenTelemetry — auto-instrument first, add custom spans for business logic, unify in Grafana.
Observability is not optional for production systems. Logs catch the details, metrics catch the trends, and traces catch the latency. Together they reduce mean time to resolution from hours to minutes. Start with structured logging and RED metrics on every service, add OpenTelemetry tracing when you have more than two services, and build dashboards before you need them — not during an outage.
Related Articles
Explore this topic