Building a Reliable Webhook System: Delivery Guarantees and Failure Handling
A complete guide to building production-grade webhooks — HMAC signatures, retry logic, idempotency, fanout architecture, and the operational concerns that most guides skip.

James Ross Jr.
Strategic Systems Architect & Enterprise Software Developer
Webhooks sound simple — send an HTTP POST when something happens. The simplicity is deceptive. A production webhook system needs delivery guarantees, security, retry logic, failure visibility, and a way to handle the thousands of edge cases that emerge when you are delivering millions of events to hundreds of different endpoints.
This guide covers building a webhook system that behaves correctly under failure conditions and gives customers the reliability they need to build against.
The Core Architecture
A naive webhook system: an event happens, you send a POST, you move on. The problem is what happens when the POST fails — the customer's endpoint is down, returns a 500, or times out. The event is lost.
A reliable webhook system separates event publishing from delivery:
Event occurs
→ Write to webhook_events table (durable)
→ Enqueue delivery job
→ Job delivers to each endpoint
→ Retry on failure
→ Mark delivered or permanently failed
This design ensures that even if every delivery attempt fails, the event is recorded and can be replayed.
Database Schema
CREATE TABLE webhook_endpoints (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id),
url TEXT NOT NULL,
secret TEXT NOT NULL, -- Stored encrypted
events TEXT[] NOT NULL DEFAULT '{}', -- Which events to subscribe to
active BOOLEAN NOT NULL DEFAULT true,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE webhook_deliveries (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
endpoint_id UUID NOT NULL REFERENCES webhook_endpoints(id),
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
status TEXT NOT NULL DEFAULT 'pending', -- pending, delivered, failed
attempts INTEGER NOT NULL DEFAULT 0,
next_retry_at TIMESTAMP,
last_error TEXT,
delivered_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_webhook_deliveries_status ON webhook_deliveries(status, next_retry_at)
WHERE status IN ('pending', 'failed');
HMAC Signatures
Endpoints cannot trust that an incoming webhook is really from you without cryptographic verification. Sign every payload with HMAC-SHA256:
import crypto from 'crypto'
export function signPayload(payload: string, secret: string): string {
const timestamp = Math.floor(Date.now() / 1000).toString()
const signedPayload = `${timestamp}.${payload}`
const signature = crypto
.createHmac('sha256', secret)
.update(signedPayload)
.digest('hex')
return `t=${timestamp},v1=${signature}`
}
// Include in headers
headers: {
'Content-Type': 'application/json',
'Webhook-Signature': signPayload(JSON.stringify(payload), endpoint.secret),
'Webhook-ID': deliveryId,
'Webhook-Timestamp': timestamp,
}
Verification code your customers implement:
function verifyWebhook(
payload: string,
signature: string,
secret: string,
toleranceSeconds = 300
): boolean {
const parts = Object.fromEntries(
signature.split(',').map(p => p.split('='))
)
const timestamp = parseInt(parts.t)
const receivedSig = parts.v1
// Reject old webhooks (replay attack prevention)
if (Math.abs(Date.now() / 1000 - timestamp) > toleranceSeconds) {
return false
}
const expectedSig = crypto
.createHmac('sha256', secret)
.update(`${timestamp}.${payload}`)
.digest('hex')
// Constant-time comparison prevents timing attacks
return crypto.timingSafeEqual(
Buffer.from(receivedSig),
Buffer.from(expectedSig)
)
}
Retry Logic With Exponential Backoff
Delivery failures should be retried with exponential backoff:
const RETRY_DELAYS = [
5, // 5 seconds
30, // 30 seconds
300, // 5 minutes
1800, // 30 minutes
7200, // 2 hours
86400, // 24 hours
]
async function deliverWebhook(deliveryId: string): Promise<void> {
const delivery = await db.query.webhookDeliveries.findFirst({
where: eq(webhookDeliveries.id, deliveryId),
with: { endpoint: true },
})
if (!delivery) return
const payload = JSON.stringify({
id: delivery.id,
type: delivery.eventType,
data: delivery.payload,
created: delivery.createdAt.toISOString(),
})
try {
const response = await fetch(delivery.endpoint.url, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Webhook-Signature': signPayload(payload, delivery.endpoint.secret),
'Webhook-ID': delivery.id,
},
body: payload,
signal: AbortSignal.timeout(30000), // 30 second timeout
})
if (response.ok) {
await db.update(webhookDeliveries)
.set({ status: 'delivered', deliveredAt: new Date() })
.where(eq(webhookDeliveries.id, deliveryId))
return
}
throw new Error(`HTTP ${response.status}: ${await response.text()}`)
} catch (error) {
const attempts = delivery.attempts + 1
const maxAttempts = RETRY_DELAYS.length
if (attempts >= maxAttempts) {
await db.update(webhookDeliveries)
.set({
status: 'failed',
attempts,
lastError: error instanceof Error ? error.message : 'Unknown error',
})
.where(eq(webhookDeliveries.id, deliveryId))
// Disable endpoint after repeated failures
await checkAndDisableEndpoint(delivery.endpointId)
return
}
const delaySeconds = RETRY_DELAYS[attempts - 1]
const nextRetryAt = new Date(Date.now() + delaySeconds * 1000)
await db.update(webhookDeliveries)
.set({
status: 'pending',
attempts,
nextRetryAt,
lastError: error instanceof Error ? error.message : 'Unknown error',
})
.where(eq(webhookDeliveries.id, deliveryId))
}
}
The Delivery Worker
A worker process polls for pending deliveries:
async function runDeliveryWorker() {
while (true) {
const pending = await db.select()
.from(webhookDeliveries)
.where(and(
eq(webhookDeliveries.status, 'pending'),
lte(webhookDeliveries.nextRetryAt, new Date()),
))
.limit(50)
if (pending.length === 0) {
await new Promise(resolve => setTimeout(resolve, 5000))
continue
}
// Process deliveries concurrently
await Promise.allSettled(
pending.map(delivery => deliverWebhook(delivery.id))
)
}
}
In production, use a proper job queue (BullMQ, Inngest, or similar) rather than polling. The database polling approach works for modest volumes but does not scale to high delivery rates.
Idempotency
Webhooks may be delivered more than once (the delivery succeeded but your acknowledgment was lost, so the system retried). Customers must handle duplicate deliveries.
Every webhook should have a unique ID that customers can use to deduplicate:
{
"id": "evt_01j9abc...",
"type": "payment.succeeded",
"data": { ... },
"created": "2026-03-03T12:00:00Z"
}
Customers store processed event IDs:
// Customer-side deduplication
async function handleWebhook(event: WebhookEvent) {
const alreadyProcessed = await redis.set(
`webhook:${event.id}`,
'1',
'EX', 86400, // 24 hours
'NX' // Only set if not exists
)
if (!alreadyProcessed) {
return // Already processed
}
// Process the event
}
Fanout to Multiple Endpoints
When a single event needs to be delivered to multiple endpoints (different customers subscribed to the same event type), create a delivery record per endpoint:
async function publishEvent(eventType: string, payload: unknown) {
// Find all active endpoints subscribed to this event type
const endpoints = await db.select()
.from(webhookEndpoints)
.where(and(
eq(webhookEndpoints.active, true),
sql`${webhookEndpoints.events} @> ARRAY[${eventType}]`
))
// Create delivery records for each endpoint
if (endpoints.length > 0) {
await db.insert(webhookDeliveries)
.values(endpoints.map(endpoint => ({
endpointId: endpoint.id,
eventType,
payload: payload as Record<string, unknown>,
nextRetryAt: new Date(),
})))
}
}
Operational Visibility
Your customers need to see delivery attempts, successes, and failures. Build a delivery log UI:
// GET /api/webhooks/deliveries
app.get('/api/webhooks/deliveries', requireAuth, async (c) => {
const userId = c.get('userId')
const { endpointId, status, limit = 50 } = c.req.query()
const deliveries = await db.select()
.from(webhookDeliveries)
.innerJoin(webhookEndpoints, eq(webhookEndpoints.id, webhookDeliveries.endpointId))
.where(and(
eq(webhookEndpoints.userId, userId),
endpointId ? eq(webhookDeliveries.endpointId, endpointId) : undefined,
status ? eq(webhookDeliveries.status, status) : undefined,
))
.orderBy(desc(webhookDeliveries.createdAt))
.limit(Number(limit))
return c.json(deliveries)
})
// POST /api/webhooks/deliveries/:id/retry
app.post('/api/webhooks/deliveries/:id/retry', requireAuth, async (c) => {
// Allow manual retry of failed deliveries
await db.update(webhookDeliveries)
.set({ status: 'pending', nextRetryAt: new Date() })
.where(eq(webhookDeliveries.id, c.req.param('id')))
return c.json({ success: true })
})
Testing Your Webhook System
Provide a test mode that sends webhooks to a local endpoint or a testing service like webhook.site. For development, use a tool like ngrok or Cloudflare Tunnel to expose your local server:
// Test webhook endpoint
app.post('/api/webhooks/test', requireAuth, async (c) => {
const { endpointId, eventType } = await c.req.json()
await publishEvent(eventType, {
test: true,
timestamp: new Date().toISOString(),
})
return c.json({ success: true, message: 'Test event published' })
})
A reliable webhook system is the foundation of a trustworthy API platform. Getting it right means your customers can build confidently on your events, knowing that delivery failures are handled gracefully and every event is auditable.
Building a webhook system or adding event-driven features to an existing API? I have built these in production and can help you avoid the common pitfalls. Book a call: calendly.com/jamesrossjr.