System Design Fundamentals

Lesson 11 of 12 · 18 min

x
92%

Observability: Logs, Metrics & Traces

Observability is the ability to understand what a system is doing from its external outputs. The three pillars are logs (what happened), metrics (how much and how fast), and distributed traces (where time was spent across services). A system with all three lets you move from alert to root cause without guessing. Logs should be structured JSON — machine-readable and filterable. Metrics expose counters, gauges, and histograms — request rate, error rate, and latency percentiles (p50, p95, p99) are the minimum set. Distributed tracing assigns a trace ID to every request that propagates across service boundaries — each service records a span, and the trace view shows the full call tree with timing. OpenTelemetry is the vendor-neutral standard for all three; emit once, ship to any backend.

Before
Unstructured logs — impossible to query
// Cannot filter, aggregate, or alert on this
console.log("User 123 placed order 456 for $99.00");
console.log("Error processing payment for order 456");

// Try answering: "orders over $50 that failed last hour"
// → impossible without parsing free-form strings
After
Structured logs + metrics + trace context
// Structured log — filterable by any field
logger.info({
  event: 'order.placed',
  userId: '123',
  orderId: '456',
  amount: 99.00,
  traceId: span.traceId,   // links log to trace
});

// Metric — alertable, graphable
orderCounter.inc({ status: 'success' });
orderLatency.observe(Date.now() - startTime);

Key Takeaway

Structured logs, p99 latency metrics, and distributed traces are not optional in production — they are how you find problems before customers do.

PreviousNext Lesson