High Availability & Fault Tolerance
High availability means the system keeps serving requests even when components fail. The target is expressed in nines: 99.9% uptime allows ~8.7 hours of downtime per year; 99.99% allows ~52 minutes. Achieving higher nines requires eliminating every single point of failure — each one becomes the ceiling on your overall availability. Circuit breakers stop cascading failures. When a downstream service starts timing out, the circuit breaker opens and returns an error or fallback immediately instead of waiting and holding threads. After a cooldown it probes again and closes if the service recovers. Retries with exponential backoff and jitter handle transient failures without thundering-herd storms. Together these patterns isolate component failures so they stay local rather than cascading into system-wide outages.