Background Jobs in Node.js: Queues, Workers, and Failure Recovery

Every production application eventually needs to do work outside of the request-response cycle. Email sending, PDF generation, image processing, webhook delivery, data imports, report generation — these are all things you should not block a user's request waiting for. Background jobs are how you handle them.

The challenge is not adding a job queue — it is building one that behaves correctly when things go wrong: workers crash, jobs fail, the database is temporarily unavailable, or the queue gets backed up. This article walks through the patterns that handle those situations correctly.

Why Queues, Not Just setTimeout

The temptation is to offload work with setTimeout or setImmediate. This breaks in several ways:

Process restarts lose all in-flight work
No visibility into job status or failures
No retry logic for transient failures
No rate limiting for external API calls
No concurrency control for resource-intensive operations

A proper job queue stores jobs durably, tracks their state, provides retry logic, and gives you visibility into what is happening.

BullMQ: The Standard Choice

BullMQ (backed by Redis) is my default for Node.js job queues. It is mature, TypeScript-first, and handles the edge cases correctly.

npm install bullmq ioredis

Defining Jobs With Type Safety

// types/jobs.ts
export interface EmailJob {
 to: string
 subject: string
 template: 'welcome' | 'reset-password' | 'invoice'
 data: Record<string, unknown>
}

Export interface PdfJob {
 reportId: string
 userId: string
 format: 'pdf' | 'xlsx'
}

Export interface ImageProcessingJob {
 imageId: string
 operations: Array<{
 type: 'resize' | 'crop' | 'convert'
 params: Record<string, unknown>
 }>
}

Export type JobData = {
 email: EmailJob
 pdf: PdfJob
 'image-processing': ImageProcessingJob
}

Setting Up Queues

// queues/index.ts
import { Queue } from 'bullmq'
import { redis } from '../lib/redis'
import type { JobData } from '../types/jobs'

Function createQueue<K extends keyof JobData>(name: K) {
 return new Queue<JobData[K]>(name, {
 connection: redis,
 defaultJobOptions: {
 attempts: 3,
 backoff: {
 type: 'exponential',
 delay: 2000, // Start at 2s, then 4s, 8s
 },
 removeOnComplete: { count: 100 }, // Keep last 100 completed
 removeOnFail: { count: 500 }, // Keep last 500 failed for debugging
 },
 })
}

Export const emailQueue = createQueue('email')
export const pdfQueue = createQueue('pdf')
export const imageQueue = createQueue('image-processing')

Adding Jobs

// In your API handlers
await emailQueue.add('send-welcome', {
 to: user.email,
 subject: 'Welcome to the platform',
 template: 'welcome',
 data: { name: user.name, activationUrl: `https://app.com/activate/${token}` },
})

// Priority jobs (lower number = higher priority)
await emailQueue.add(
 'send-password-reset',
 {
 to: user.email,
 subject: 'Reset your password',
 template: 'reset-password',
 data: { resetUrl: `https://app.com/reset/${token}` },
 },
 { priority: 1 } // Process before normal priority jobs
)

// Delayed jobs
await emailQueue.add(
 'send-trial-expiry-warning',
 { to: user.email, template: 'trial-expiry', data: {} },
 { delay: 7 * 24 * 60 * 60 * 1000 } // 7 days from now
)

// Scheduled recurring jobs
await emailQueue.add(
 'weekly-digest',
 { to: user.email, template: 'weekly-digest', data: {} },
 { repeat: { cron: '0 9 * * 1' } } // Every Monday at 9am
)

Worker Implementation

// workers/email.ts
import { Worker, UnrecoverableError } from 'bullmq'
import { redis } from '../lib/redis'
import type { EmailJob } from '../types/jobs'

Const emailWorker = new Worker<EmailJob>(
 'email',
 async (job) => {
 const { to, subject, template, data } = job.data

 job.log(`Sending ${template} email to ${to}`)
 await job.updateProgress(10)

 // Render the email template
 const html = await renderTemplate(template, data)
 await job.updateProgress(40)

 // Send via your email provider
 await sendEmail({ to, subject, html })
 await job.updateProgress(100)

 return { sentAt: new Date().toISOString() }
 },
 {
 connection: redis,
 concurrency: 10, // Process 10 emails simultaneously
 limiter: {
 max: 100, // Max 100 jobs per interval
 duration: 1000, // Per second
 },
 }
)

// Handle worker events
emailWorker.on('completed', (job) => {
 console.log(`Job ${job.id} completed`)
})

EmailWorker.on('failed', (job, err) => {
 console.error(`Job ${job?.id} failed:`, err.message)

 // Alert on final failure (all retries exhausted)
 if ((job?.attemptsMade ?? 0) >= (job?.opts.attempts ?? 1)) {
 console.error(`Job permanently failed after ${job?.attemptsMade} attempts`)
 // Send alert to your monitoring system
 }
})

EmailWorker.on('error', (err) => {
 console.error('Worker error:', err)
})

Non-Retryable Errors

Some failures should not be retried. If an email address is permanently invalid or a user does not exist, retrying wastes resources and clutters your failed job logs.

import { UnrecoverableError } from 'bullmq'

Async (job) => {
 const user = await db.user.findUnique({ where: { id: job.data.userId } })

 if (!user) {
 // Throw UnrecoverableError to skip retries
 throw new UnrecoverableError(`User ${job.data.userId} not found`)
 }

 if (user.emailBounced) {
 throw new UnrecoverableError(`Email bounced for user ${user.email}`)
 }

 // ... Proceed with sending
}

Job Progress and Logging

Progress tracking gives you visibility into long-running jobs:

async (job) => {
 const rows = await db.select().from(users).where(eq(users.status, 'active'))
 const total = rows.length

 for (let i = 0; i < rows.length; i++) {
 await processUser(rows[i])

 // Update progress
 await job.updateProgress(Math.round((i / total) * 100))

 // Log to job's log (visible in Bull Board)
 if (i % 100 === 0) {
 job.log(`Processed ${i}/${total} users`)
 }
 }
}

Priority Queues

For applications with multiple job types competing for worker resources, use priority:

const reportQueue = new Queue('reports', {
 connection: redis,
 defaultJobOptions: { priority: 10 }, // Default priority
})

// VIP customer report: high priority
await reportQueue.add(
 'generate-report',
 { customerId, reportType },
 { priority: 1 } // Lower number = higher priority
)

// Background analytics: low priority
await reportQueue.add(
 'generate-analytics',
 { period: 'monthly' },
 { priority: 100 }
)

BullMQ Flow: Job Chains and Pipelines

For multi-step workflows where jobs depend on each other:

import { FlowProducer } from 'bullmq'

Const flow = new FlowProducer({ connection: redis })

// Create a data import pipeline
await flow.add({
 name: 'validate-and-import',
 queueName: 'validation',
 data: { fileId },
 children: [
 {
 name: 'process-data',
 queueName: 'processing',
 data: { fileId },
 children: [
 {
 name: 'generate-report',
 queueName: 'reporting',
 data: { fileId },
 },
 ],
 },
 ],
})

Child jobs run first. Parent jobs wait for all children to complete. If a child fails, the parent is not started.

Bull Board: Monitoring Dashboard

Install Bull Board for a visual dashboard of your queues:

import { createBullBoard } from '@bull-board/api'
import { BullMQAdapter } from '@bull-board/api/bullMQAdapter'
import { HonoAdapter } from '@bull-board/hono'

Const serverAdapter = new HonoAdapter()

CreateBullBoard({
 queues: [
 new BullMQAdapter(emailQueue),
 new BullMQAdapter(pdfQueue),
 new BullMQAdapter(imageQueue),
 ],
 serverAdapter,
})

App.route('/admin/queues', serverAdapter.registerPlugin())

Protect this route with admin authentication. The dashboard shows queue depth, job throughput, failure rates, and lets you manually retry or delete jobs.

Graceful Shutdown

Workers should finish in-progress jobs before shutting down:

async function shutdown() {
 console.log('Shutting down workers...')

 await emailWorker.close()
 await pdfWorker.close()

 console.log('Workers stopped gracefully')
 process.exit(0)
}

Process.on('SIGTERM', shutdown)
process.on('SIGINT', shutdown)

worker.close() stops accepting new jobs and waits for current jobs to complete before returning.

Deployment Considerations

Run workers as separate processes from your API server. This allows:

Independent scaling (more workers for high-volume queues)
Separate restarts (worker crash does not affect API)
Per-worker resource configuration (more memory for image processing workers)

In Docker, a separate service per worker type:

# docker-compose.yml
services:
 api:
 build: . Command: node dist/api.js

 email-worker:
 build: . Command: node dist/workers/email.js
 scale: 2 # Two instances for redundancy

 pdf-worker:
 build: . Command: node dist/workers/pdf.js
 environment:
 - WORKER_CONCURRENCY=2 # PDF is memory-intensive

Background jobs are infrastructure you build once and rely on continuously. Design them with failure in mind from the start and you will sleep better when things inevitably go wrong.

Designing a background job architecture or migrating from a brittle in-process approach to a proper queue? I can help design a system that scales. Book a call: calendly.com/jamesrossjr.