Rate Limiting Algorithms: Token Bucket, Sliding Window, and More

Why Rate Limiting Is a Design Problem, Not Just a Security Feature

Rate limiting is usually introduced as a security measure — protect your API from abuse, prevent DDoS attacks, stop malicious bots. These are valid motivations, but they undersell the concept. Rate limiting is fundamentally about resource management: ensuring that your system provides consistent service to all users by preventing any single user or pattern of usage from consuming disproportionate resources.

Without rate limiting, a single customer's automated script can degrade the experience for every other customer. A misconfigured integration partner can send ten thousand requests per second and effectively take your API offline. A legitimate traffic spike can overwhelm your database connection pool and cascade into failures across your entire system.

The algorithm you choose for rate limiting determines the behavior characteristics of your system under load — how smooth the rate enforcement is, how it handles bursts, and how fairly it distributes capacity across users. Each algorithm makes different trade-offs, and understanding those trade-offs is essential for choosing the right one.

The Algorithms Compared

Fixed window counting is the simplest approach. Divide time into fixed intervals (e.g., one-minute windows), count requests within each window, and reject requests that exceed the limit. Implementation is straightforward: maintain a counter per user per window in Redis or a similar store, increment on each request, and compare against the threshold.

The weakness is the boundary problem. A user who sends 100 requests in the last second of one window and 100 requests in the first second of the next window has sent 200 requests in two seconds while staying within a 100-per-minute limit in both windows. At the boundary, the effective rate can be double your intended limit.

Sliding window log solves the boundary problem by tracking the timestamp of every request. When a new request arrives, count the number of timestamps within the past window duration (e.g., 60 seconds) and compare against the limit. This provides exact enforcement but requires storing every timestamp, which can be memory-intensive for high-volume APIs.

Sliding window counter is a practical compromise. It combines the current window's count with a weighted portion of the previous window's count based on how far into the current window you are. If you're 30 seconds into a 60-second window, the effective count is (current window count) + (previous window count x 0.5). This approximation is close enough for most applications and uses only two counters per user instead of a log of timestamps.

Token bucket is the most flexible algorithm and the one I reach for most often. Each user has a bucket that holds a maximum number of tokens (the burst limit). Tokens are added at a steady rate (the sustained rate). Each request consumes one token. If the bucket is empty, the request is rejected or queued.

The elegance of token bucket is that it naturally handles both sustained rates and bursts. A user with a bucket capacity of 20 and a refill rate of 10 per second can burst to 20 requests immediately, then sustain 10 per second thereafter. This matches how most real API usage looks — occasional bursts of activity within an overall rate constraint. The implementation requires only two values per user: the current token count and the timestamp of the last refill calculation.

Leaky bucket processes requests at a fixed rate, like water leaking from a bucket at a constant drip. Requests are queued and processed in order. If the queue is full, new requests are rejected. This produces the smoothest output rate — perfectly uniform — but adds latency because requests wait in the queue. It's appropriate for scenarios where downstream systems require a strictly constant rate of incoming work.

Implementation Decisions

Where to enforce. Rate limiting can happen at the API gateway level, the application level, or both. Gateway-level limiting protects against volume attacks before requests reach your application code. Application-level limiting allows more granular rules — different limits per endpoint, per user tier, or per operation type. For most systems, implement broad protection at the gateway and fine-grained rules in the application.

What to limit by. User ID, API key, IP address, or a combination. User-based limiting is the most fair but requires authentication before the limit check. IP-based limiting works for unauthenticated endpoints but punishes users behind shared IPs (corporate networks, VPNs). API key limiting works well for machine-to-machine APIs. Many systems use IP-based limiting for unauthenticated endpoints and user-based limiting for authenticated ones.

How to communicate limits. Include rate limit information in response headers: X-RateLimit-Limit (the maximum), X-RateLimit-Remaining (how many requests are left), and X-RateLimit-Reset (when the limit resets). When a request is rate-limited, return a 429 status code with a Retry-After header. This lets well-behaved clients adjust their request patterns proactively rather than hammering your API until they're allowed through.

Distributed rate limiting is necessary when your API runs on multiple servers. Each server can't maintain its own independent counters because users would get N times the intended limit by rotating across N servers. Centralized counters in Redis are the standard solution, using atomic increment operations (INCR with EXPIRE for fixed windows, or Lua scripts for token bucket) to ensure consistency. The trade-off is an additional network round-trip per request to check the counter, but Redis latency is typically under a millisecond, making this negligible.

Common Pitfalls

Avoid rate limits that punish legitimate usage patterns. If your API serves a dashboard that loads five resources simultaneously, a rate limit of five requests per second means the dashboard fails on load. Understand your clients' actual usage patterns before setting limits. Overly aggressive limits create more support burden than they prevent.

Don't forget to rate-limit internal services. Microservices that call each other without rate limits can create cascading failures when one service slows down and another retries aggressively. Internal rate limiting — or circuit breakers, which are complementary — prevents one struggling service from taking down the entire system.

Test your rate limiting under realistic conditions. A rate limiter that works correctly at 100 requests per second might behave differently at 10,000 requests per second due to Redis contention, clock skew between servers, or counter overflow. Load test the rate limiting infrastructure itself, not just the application behind it.

Rate limiting is one component of a broader API resilience strategy. Combined with proper error handling and thoughtful API design, it ensures that your system degrades gracefully under pressure rather than failing catastrophically.