ObservabilityOpenTelemetryMonitoringSystem DesignDevOps

Observability: Logs, Metrics, and Traces

July 4, 202611 min read

When a production system breaks at 3am, you need to answer three questions fast: what happened, how bad is it, and where is the bottleneck? Logs tell you what happened — discrete events with timestamps and context. Metrics tell you how bad — aggregated numbers like error rate, latency percentiles, and CPU usage. Traces tell you where — the full journey of a single request across every service and database call.

These three pillars form observability. Monitoring tells you when something is wrong; observability lets you understand why. Modern stacks unify all three with OpenTelemetry — one SDK that exports logs, metrics, and traces to Prometheus, Grafana, Jaeger, Datadog, or AWS X-Ray.

Three pillars of observability

Logs

Logs are timestamped records of discrete events — errors, warnings, info messages, and debug output. A good log line includes a timestamp, severity level, service name, trace ID, and structured context (user ID, order ID, request path). Unstructured logs ("payment failed") are hard to search; structured JSON logs ({"level":"error","order_id":991,"reason":"timeout"}) are queryable at scale.

Centralize logs with ELK (Elasticsearch, Logstash, Kibana), Grafana Loki, or CloudWatch Logs. Never log secrets — passwords, tokens, credit card numbers. Use log levels consistently: ERROR for failures requiring action, WARN for recoverable issues, INFO for business events, DEBUG for development only.

Logs: timestamped events with context

Quick reference

Best for: debugging specific failures, audit trails, understanding sequence of events.
Strengths: rich context, human-readable, captures unexpected edge cases.
Weaknesses: expensive at scale, high cardinality, not good for aggregates or trends.
Use structured JSON logging — not string concatenation.
Include trace_id in every log line to correlate with distributed traces.
Set retention policies — 30 days hot, 90 days cold, archive for compliance.

Remember this

Logs answer "what happened?" — use structured JSON with trace IDs for searchable, correlatable events.

Metrics

Metrics are numeric measurements aggregated over time — counters (total requests), gauges (current memory usage), and histograms (latency distribution). Unlike logs, metrics are cheap to store and query at scale. A Prometheus server scraping /metrics every 15 seconds can track millions of time series across hundreds of services.

Define SLIs (Service Level Indicators) from metrics: availability (uptime %), latency (p99 response time), and error rate (5xx / total). Set SLOs (Service Level Objectives) — "99.9% of requests complete under 200ms" — and alert when error budgets burn too fast. RED method (Rate, Errors, Duration) covers most HTTP services; USE method (Utilization, Saturation, Errors) covers infrastructure.

Metrics: counters, gauges, histograms over time

Quick reference

Best for: dashboards, alerting, capacity planning, SLO tracking.
Strengths: cheap storage, fast queries, trend detection, proactive alerting.
Weaknesses: no per-request detail, loses context, requires upfront instrumentation.
RED for services: Request rate, Error rate, Duration (latency).
USE for resources: Utilization, Saturation, Errors.
Alert on symptoms (high error rate), not causes (high CPU) — CPU might be fine under load.

Remember this

Metrics answer "how much and how fast?" — instrument RED metrics on every service and alert on SLO breaches.

Distributed Traces

A distributed trace follows one request from entry to exit across all services. Each service creates spans — timed operations with metadata. A trace ID links spans across services; a parent span ID shows nesting. When checkout takes 2 seconds, the trace reveals that 1.8 seconds were spent in the payment service calling an external API.

OpenTelemetry is the vendor-neutral standard for generating traces, metrics, and logs. Instrument your code once with OTel SDKs, export to any backend (Jaeger, Tempo, X-Ray, Datadog). In microservices, tracing is essential — without it, debugging cross-service latency is guesswork.

Trace: one request across multiple services

Quick reference

Best for: latency debugging, cross-service dependency mapping, bottleneck identification.
Strengths: end-to-end request visibility, shows exact slow span, maps service dependencies.
Weaknesses: sampling required at scale (trace 1-10% of requests), storage cost, instrumentation effort.
Use OpenTelemetry — one SDK for traces, metrics, and logs.
Sample intelligently: always trace errors, sample 1-5% of successful requests.
Propagate trace context via HTTP headers (traceparent) and message queue metadata.

Remember this

Traces answer "where is it slow?" — use OpenTelemetry and always trace errors, sample the rest.

OpenTelemetry in Production

OpenTelemetry (OTel) provides SDKs for every major language, auto-instrumentation for popular frameworks, and a collector that receives, processes, and exports telemetry to any backend. The typical pipeline: instrument app → OTel SDK → OTel Collector → Prometheus (metrics) + Jaeger (traces) + Loki (logs) → Grafana dashboards.

Start with auto-instrumentation for HTTP, database, and message queue calls — it gives you 80% of value with zero code changes. Add custom spans for business-critical operations (payment processing, order fulfillment). Correlate all three pillars: every log line includes trace_id, every metric dashboard links to traces for anomalous periods.

Quick reference

Pipeline: App → OTel SDK → Collector → backends (Prometheus, Jaeger, Loki).
Auto-instrumentation: HTTP, gRPC, database drivers, Redis, Kafka — zero code for basics.
Custom spans: wrap business operations with tracer.startSpan() for domain visibility.
Grafana unifies metrics (Prometheus), traces (Tempo), and logs (Loki) in one UI.
Define dashboards per service: RED metrics + error trace links + recent error logs.
Run the OTel Collector as a sidecar or daemonset — not embedded in every app process.

Remember this

Standardize on OpenTelemetry — auto-instrument first, add custom spans for business logic, unify in Grafana.

Key takeaway

Observability is not optional for production systems. Logs catch the details, metrics catch the trends, and traces catch the latency. Together they reduce mean time to resolution from hours to minutes. Start with structured logging and RED metrics on every service, add OpenTelemetry tracing when you have more than two services, and build dashboards before you need them — not during an outage.

LinuxDevOps

Linux Essentials for Developers: Shell, Filesystem & Permissions

Most production servers, Docker hosts, and CI runners run Linux. You do not need to become a sysadmin, but every backend…

Read

CI/CDGitHub Actions

CI/CD Pipelines with GitHub Actions: From Commit to Production

Manual deployments are one of the highest-risk activities in software engineering. A developer SSHes into a production s…

Read

ScalingSystem Design

Horizontal vs Vertical Scaling

Scaling is how a system handles more users, data, or traffic. Vertical scaling (scale up) means giving your existing ser…

Read

Explore this topic

DevOps & Linux System Design

Keep learning

Follow a structured path or browse all courses to go deeper.

Learning Paths All Courses

Logs

Metrics

Distributed Traces

OpenTelemetry in Production

Key takeaway

Related Articles