API Performance Optimization: Making Your Endpoints Fast at Scale

API Latency Is a Product Problem

Users experience your API through your UI. Every millisecond of API latency is a millisecond added to page loads, form submissions, and data refreshes. At scale, a 200ms median latency with a 2-second p99 means 1% of requests are failing your users significantly — and if you have 10,000 API calls per minute, that's 100 slow requests per minute.

This article walks through the systematic approach to measuring API performance, identifying the root causes of latency, and applying the optimizations that actually move the numbers.

Measuring Before Optimizing

You can't optimize what you don't measure, and you can't know if your optimization worked without before/after data.

The metrics that matter:

Median (p50) latency: The typical user experience. This is the number most monitoring dashboards show, and it's the least useful by itself.
p95 latency: 95% of requests are faster than this. The experience of most users who are having a "bad" day.
p99 latency: 99% of requests are faster than this. The tail latency. This is where real pain lives.
Error rate: The percentage of requests returning 4xx or 5xx responses.
Throughput: Requests per second. Rising throughput with rising latency indicates a scaling problem.

Median tells you the typical case. P99 tells you the worst case that users regularly encounter. Both matter; optimize p99 without letting median degrade.

Instrumentation at the request level:

app.use(async (c, next) => {
 const start = Date.now()
 await next()
 const duration = Date.now() - start

 // Send to your observability platform
 metrics.histogram('api.request.duration', duration, {
 method: c.req.method,
 path: c.req.path,
 status: c.res.status.toString(),
 })
})

With per-route timing data in your observability platform, you can identify exactly which endpoints have high latency and tail latency problems.

The Database Layer: Usually Where Latency Lives

As covered in the database performance article, the most common cause of slow API endpoints is slow database queries. The first step in diagnosing a slow endpoint is measuring query time versus total request time.

Isolate the database time:

async function getProjectDetails(projectId: string) {
 const t0 = Date.now()
 const project = await db.project.findUnique({
 where: { id: projectId },
 include: { members: true, tasks: true, milestones: true }
 })
 metrics.histogram('db.query.project_details', Date.now() - t0)
 return project
}

If the database query takes 180ms out of a 200ms request, fixing the query solves 90% of the problem. If the query takes 10ms and the request takes 200ms, the problem is elsewhere.

Query optimization quick wins:

Add indexes on columns used in WHERE, JOIN, and ORDER BY clauses
Replace N+1 patterns with eager loading or batch queries
Select only the columns you need instead of SELECT *
Cache results for frequently-read, infrequently-changing data

Caching at the API Layer

Application-level caching (Redis) reduces database load and request latency for read-heavy operations. The patterns that work:

Response caching for public data:

async function getPublicProjectStats(projectId: string) {
 const cacheKey = `stats:${projectId}`
 const cached = await redis.get(cacheKey)
 if (cached) return JSON.parse(cached)

 const stats = await computeProjectStats(projectId)
 await redis.set(cacheKey, JSON.stringify(stats), 'EX', 300)
 return stats
}

Cache warming for predictable access patterns: For dashboards that aggregate data (weekly summaries, report data), pre-compute and cache the results on a schedule rather than computing on-demand. The report runs at midnight; users get the cached result instantly.

Selective caching based on cache hit rates: Not all data is worth caching. Cache data that is expensive to compute (complex aggregations, multiple-table joins), accessed frequently (dashboard data, user preferences), and has low invalidation frequency. Skip caching for data that changes per-request or has complex invalidation logic.

Payload Size and Serialization

Large response payloads increase network transfer time and deserialization time on the client. Auditing what your API returns is often surprisingly productive.

Return only the fields the client needs. If your user endpoint returns 30 fields but the client only uses 8, you're transferring and serializing 22 unnecessary fields on every call. GraphQL solves this structurally; with REST, use field selection parameters or create endpoint variants for different use cases.

Paginate large collections. Returning 1,000 items in a single response is almost never correct. Add pagination to any endpoint that can return more than 100 items. Cursor-based pagination (returning a cursor for the next page rather than an offset) is more efficient for large datasets.

JSON serialization performance. The standard JSON.stringify is slower than specialized serializers for high-volume endpoints. Libraries like fast-json-stringify (which pre-compiles serializers from a schema) are 2-5x faster for complex objects.

Compression. Enable gzip or Brotli compression for responses. This is almost always a net win for text-based API responses over JSON — typical compression ratios are 70-80% for large JSON payloads. The CPU cost is low relative to the network transfer savings, especially for mobile clients on variable connections.

Connection Management

Database connection pooling. Each database connection has overhead — memory on the database server, TCP connection setup cost, and in PostgreSQL, a dedicated process. Without connection pooling, every request creates and destroys a connection. With pooling, connections are reused.

For PostgreSQL, use PgBouncer (external) or the connection pool built into Prisma. Configure the pool size based on your database server's max_connections setting and your application's concurrency — a common starting point is pool size = (number of CPU cores × 2) + spindle count.

HTTP keep-alive for external API calls. If your API makes HTTP requests to external services, use an HTTP client that maintains keep-alive connections rather than creating a new connection for each request. In Node.js, use an https.Agent with keepAlive: true:

const agent = new https.Agent({ keepAlive: true, maxSockets: 100 })
const response = await fetch(url, { agent })

This eliminates TCP handshake overhead for repeated calls to the same host.

Concurrency: Don't Wait When You Don't Have To

APIs that make multiple independent requests sequentially waste time. If two operations don't depend on each other, run them in parallel.

// Sequential: 300ms if each takes 150ms
const user = await getUser(userId)
const stats = await getUserStats(userId)

// Parallel: 150ms
const [user, stats] = await Promise.all([
 getUser(userId),
 getUserStats(userId),
])

This pattern is particularly impactful for endpoints that aggregate data from multiple sources — user data, their recent activity, their team members, their account status — where each query is independent.

Rate Limiting and Timeout Management

Timeouts on everything. Every external call your API makes — database queries, HTTP requests to third-party services, cache operations — should have a timeout. Without timeouts, a slow external service can hold your request threads indefinitely, causing cascading slowdowns.

const result = await Promise.race([
 fetchExternalData(id),
 new Promise((_, reject) =>
 setTimeout(() => reject(new Error('Timeout')), 5000)
 )
])

Circuit breakers for external dependencies. If a downstream service is consistently failing or slow, a circuit breaker prevents your API from waiting for it on every request. After a threshold of failures, the circuit "opens" and requests fail fast with a cached or degraded response until the downstream service recovers.

Load Testing to Validate Improvements

Profiling individual queries and caching strategies improves latency in isolation. Load testing validates that your optimizations hold under real concurrency — multiple users hitting the same endpoints simultaneously.

Tools: k6 (JavaScript test scripts, good for CI integration), Artillery (YAML-based, easy to configure), Apache JMeter (UI-based, good for complex scenarios).

A basic load test protocol: establish baseline metrics at 10, 50, 100, and 500 concurrent users. Identify the concurrency level where latency starts to degrade significantly. Optimize, re-test, repeat.

API performance optimization is a discipline, not a one-time task. Build the instrumentation, run it continuously, and treat regressions the same way you treat bugs. If you're working on an API with latency problems and want help diagnosing and prioritizing the work, book a call at calendly.com/jamesrossjr.