Skip to main content
DevOps7 min readJuly 19, 2025

Scaling SaaS Infrastructure: From 100 to 10,000 Users

Scaling a SaaS product isn't about adding servers. It's about identifying bottlenecks before they become outages and addressing them in the right order.

James Ross Jr.
James Ross Jr.

Strategic Systems Architect & Enterprise Software Developer

Scaling Problems Are Sequential, Not Parallel

The first instinct when a SaaS product starts growing is to think about scale in abstract terms — horizontal scaling, microservices, distributed systems. This is premature optimization at its worst. Scaling problems reveal themselves sequentially, and the bottleneck at 500 users is almost never the same bottleneck at 5,000 users.

The productive approach is to measure, identify the current bottleneck, fix it, and repeat. Most SaaS products can serve their first several thousand users on surprisingly modest infrastructure if the application code is reasonably well-written. The problems that emerge at this scale are rarely about compute capacity — they're about database queries, caching, and background job processing.

I've scaled several SaaS applications through this growth phase, and the pattern is remarkably consistent. Here's what breaks and when.


The Database Is Always First

The first bottleneck in every SaaS product I've worked on has been the database. Not because the database can't handle the load, but because the queries were written for development data volumes and tested against a database with a few hundred rows.

Indexing is the single highest-leverage improvement. A query that does a full table scan on 10,000 rows returns in 5ms. The same query on 500,000 rows takes 3 seconds. Adding the right index drops it back to 5ms. Reviewing your database indexing strategy when you have real production query patterns is far more productive than guessing at indexes during development.

N+1 queries are the second most common database performance problem. Loading a list of 50 items, each with an associated user, shouldn't require 51 queries. Use eager loading or dataloaders to batch these into two queries. The fix is usually straightforward once you identify the problem — the challenge is identifying it in the first place, which requires query logging and analysis.

Connection pooling becomes critical as your application scales horizontally. Each application instance maintains its own connection pool, and the total connections across all instances can quickly exceed the database's connection limit. A connection pooler like PgBouncer sits between your application and the database, multiplexing hundreds of application connections onto a smaller number of database connections.

Read replicas help when your read workload significantly exceeds your write workload, which is true for most SaaS applications. Route read queries to replicas and write queries to the primary. The tradeoff is replication lag — reads from a replica may not reflect the most recent writes. For most application queries this is acceptable, but for operations where the user just performed a write and expects to see the result immediately, you need to route to the primary.


Caching Strategy

Once database queries are optimized, caching is the next layer of performance improvement. But caching poorly is worse than not caching at all — stale data bugs are notoriously difficult to diagnose.

Cache at the right layer. Application-level caching (Redis or Memcached) for computed values and frequently-accessed reference data. HTTP-level caching (CDN, browser cache headers) for static assets and API responses that change infrequently. Database query caching is usually the wrong place to cache — cache the result at the application layer where you have more control over invalidation.

Invalidation strategy must be explicit. Every cached value needs a defined invalidation trigger. "Cache for 5 minutes" is fine for dashboards that don't need real-time accuracy. "Invalidate when the underlying data changes" is necessary for data that users expect to update immediately after a write. Event-driven invalidation — clearing the cache when a relevant domain event fires — keeps your caching logic decoupled from your write paths.

For multi-tenant applications, caching must be tenant-aware. A cache key that doesn't include the tenant identifier risks leaking data between tenants, which is a security incident, not just a bug.


Background Jobs and Queue Architecture

At some point, your web request handlers need to stop doing work synchronously and start delegating to background job processors. This transition typically happens around 1,000-2,000 active users, when certain operations take too long to complete within a web request timeout.

Email sending should be the first thing moved to a background queue. It adds latency and failure points to web requests without any benefit to the user — they don't need the email to arrive before the page loads.

Report generation, data exports, and bulk operations follow naturally. Anything that might take more than a few seconds should run in the background with a mechanism to notify the user when it's complete.

Queue architecture matters more than queue technology. Use a reliable queue (Redis with persistence, or a dedicated message broker) and ensure your job processors are idempotent. Jobs will be retried. Servers will crash mid-processing. Your job code must handle being executed multiple times for the same job without producing incorrect results.

Worker scaling is independent from application scaling. You can add more workers without adding more web servers, and you should monitor queue depth to detect when workers can't keep up with the incoming job rate. A growing queue depth is an early warning of a scaling bottleneck.


Monitoring Before You Need It

The prerequisite for everything I've described is monitoring that tells you what's actually happening. Application performance monitoring (APM) that tracks request latency, database query time, and error rates. Infrastructure monitoring that tracks CPU, memory, disk I/O, and network use. Custom metrics for business-relevant indicators — queue depth, active users, API response times by endpoint.

Set up alerting before you need it. The time to install monitoring is not during an outage — it's before the first outage, when you have time to instrument thoughtfully. Having the data to diagnose a bottleneck before it becomes a customer-facing issue is the difference between proactive scaling and reactive firefighting.

Building for scale is less about technology choices and more about the discipline to measure, identify bottlenecks, and address them systematically. The infrastructure that serves 10,000 users is rarely dramatically different from what serves 1,000 — it's the same stack with better queries, smarter caching, and background processing for the heavy work.


Keep Reading