Skip to content
Back to blog
Rate LimitingAPIPerformanceSystem Design

Rate Limiting: Token Bucket vs Sliding Window

July 4, 20269 min read

Every public API needs rate limiting — to prevent abuse, protect downstream services, and ensure fair usage across tenants. Without it, one aggressive client can overwhelm your database while others get timeouts. The two most common algorithms are the token bucket and the sliding window log — each with different burst tolerance and precision characteristics.

A fixed window counter is simple but allows burst spikes at window boundaries. A leaky bucket smooths traffic but adds latency. Token buckets allow controlled bursts while maintaining an average rate. Sliding windows count requests in a rolling interval for precise limits. This article compares the algorithms, shows how Redis implements them, and explains where API gateways like Kong and AWS API Gateway apply each.

Token BucketBurst-friendlyTokens refill over timeUsed by AWS, StripeSliding WindowPrecise per time windowNo burst allowanceUsed by Redis, NGINX
Token bucket vs sliding window rate limiting

Token Bucket

The token bucket algorithm maintains a bucket with a maximum capacity (burst limit) and a refill rate (sustained limit). Each incoming request consumes one token. If tokens are available, the request proceeds; if the bucket is empty, the request is rejected with HTTP 429 Too Many Requests. Tokens refill continuously at the configured rate — e.g. 100 tokens per minute with a bucket size of 20 allows 20 immediate requests, then throttles to ~1.67 per second.

This burst-friendly behavior makes token buckets ideal for APIs where occasional spikes are acceptable — like a dashboard loading ten endpoints at once. AWS API Gateway, Stripe, and GitHub's API all use token bucket variants. Implementation is simple: store {tokens, lastRefillTime} in Redis and compute refilled tokens on each request.

Bucket5 / 10 tokensrefill: 1/secRequest arrives → take 1 token → allow or 429Burst: 10 req at once OKThen throttle to refill rate
Token bucket: refill rate + burst capacity

Quick reference

  • Best for: APIs with acceptable burst traffic, user-facing endpoints, payment APIs.
  • Strengths: allows controlled bursts, smooth average rate, simple Redis implementation.
  • Weaknesses: burst can still overwhelm downstream if bucket size is too large.
  • Parameters: bucket size (max burst) + refill rate (sustained requests per second).
  • Return Retry-After header on 429 so clients know when to retry.
  • Use per-user and per-IP limits — a user on a shared IP should not block others.

Remember this

Token buckets are the default for most APIs — they balance burst tolerance with sustained rate control.

Sliding Window

The sliding window algorithm counts requests within a rolling time window. For a limit of 100 requests per minute, it counts all requests in the last 60 seconds — not just the current clock minute. When a new request arrives, remove timestamps older than 60 seconds; if the count exceeds 100, reject the request.

This eliminates the boundary spike problem of fixed windows (where 100 requests at 00:59 and 100 at 01:00 pass a 100/min limit). The cost is memory — you must store a timestamp for each request in the window. Redis sorted sets (ZADD/ZCOUNT) or a circular buffer implement this efficiently. NGINX's limit_req module and Redis Cell use sliding window variants.

60-second windowCount requests in last N seconds — reject if over limit
Sliding window: count requests in rolling interval

Quick reference

  • Best for: strict rate limits, anti-abuse, login endpoints, SMS/email sending.
  • Strengths: precise limits, no boundary spikes, fair across time.
  • Weaknesses: higher memory usage (stores timestamps), no burst allowance.
  • Redis implementation: ZADD key timestamp timestamp; ZREMRANGEBYSCORE key 0 (now - window); ZCARD key.
  • Sliding window counter (approximate) reduces memory by dividing the window into sub-buckets.
  • Combine with exponential backoff on the client for retry storms.

Remember this

Sliding windows enforce precise limits — use them where burst tolerance is a security risk, not a feature.

Implementing Rate Limits in Production

Rate limiting belongs at the edge — API Gateway, NGINX, or a dedicated middleware — not deep inside business logic. Centralize limits in Redis so all app instances share the same counters. Return standard HTTP 429 with Retry-After and X-RateLimit-Remaining headers so clients can adapt.

Tier limits by user plan: free tier gets 100 req/min, pro gets 1000. Use different limits per endpoint — login gets 5/min (anti-brute-force), read endpoints get 1000/min. Log rate limit hits to detect abuse patterns. For distributed systems, consider a global rate limiter (Redis Cluster) vs per-region limits to avoid cross-region latency.

Quick reference

  • Place rate limiting at API Gateway or reverse proxy — before requests hit application code.
  • Use Redis for shared state across multiple app instances.
  • Return headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After.
  • Different limits per endpoint: strict on /login, generous on /health.
  • Tier by plan: free, pro, enterprise with different bucket sizes and refill rates.
  • Monitor 429 rates — a spike may indicate a bug, abuse, or need to raise limits.

Remember this

Centralize rate limiting at the edge with Redis-backed counters and clear HTTP 429 responses.

Key takeaway

Share:

Token buckets and sliding windows solve different problems. Use token buckets when controlled bursts are acceptable and you want simple implementation. Use sliding windows when precise, strict limits matter — login endpoints, SMS gateways, and anti-abuse scenarios. Most production APIs combine both: token bucket for general API traffic and sliding window for sensitive endpoints. The algorithm matters less than consistent enforcement, clear error responses, and monitoring.

Related Articles

ScalingSystem Design

Scaling is how a system handles more users, data, or traffic. Vertical scaling (scale up) means giving your existing ser

Read

Caching is the fastest way to scale a system without changing your database or adding servers. Done right, it cuts datab

Read

A query that takes 4 seconds without an index takes 0.2ms with one. That's a 20,000x improvement from a single line of S

Read

Keep learning

Follow a structured path or browse all courses to go deeper.