High Availability & Fault Tolerance

High availability means the system keeps serving requests even when components fail. The target is expressed in nines: 99.9% uptime allows ~8.7 hours of downtime per year; 99.99% allows ~52 minutes. Achieving higher nines requires eliminating every single point of failure — each one becomes the ceiling on your overall availability. Circuit breakers stop cascading failures. When a downstream service starts timing out, the circuit breaker opens and returns an error or fallback immediately instead of waiting and holding threads. After a cooldown it probes again and closes if the service recovers. Retries with exponential backoff and jitter handle transient failures without thundering-herd storms. Together these patterns isolate component failures so they stay local rather than cascading into system-wide outages.

Before

No resilience — one slow service takes down the API

async function getRecommendations(userId: string) {
  // If recommendation service hangs for 30s,
  // this holds a thread for 30s.
  // Under load: all threads exhausted → entire API down.
  return recommendationService.get(userId);
}

After

Circuit breaker with fallback

const breaker = new CircuitBreaker(recommendationService.get, {
  timeout: 3000,                 // fail fast after 3s
  errorThresholdPercentage: 50,  // open after 50% errors
  resetTimeout: 10000,           // probe again after 10s
});

async function getRecommendations(userId: string) {
  try {
    return await breaker.fire(userId);
  } catch {
    // Fallback: return popular items — always available
    return redis.get('popular:recommendations');
  }
}

Key Takeaway

Eliminate single points of failure, add circuit breakers to prevent cascades, and always define a fallback — failures are not if but when.