Building Scalable AI Pipelines with Kubernetes and Ray
Learn how to create production-ready machine learning pipelines that can scale from a single GPU to thousands of nodes seamlessly.

James Ross Jr.
Full-Stack Developer & Systems Architect
The Challenge of Scaling AI Workloads
In 2024, AI workloads have become increasingly complex. We're no longer talking about training simple models on a single GPU—we're dealing with massive language models, computer vision pipelines processing millions of images, and real-time inference systems serving billions of predictions daily. When I first attempted to scale our ML infrastructure at my previous company, we hit a wall at around 100 concurrent training jobs. The solution? A combination of Ray and Kubernetes that now handles over 10,000 concurrent jobs seamlessly.
The fundamental challenge isn't just about throwing more hardware at the problem. It's about efficiently orchestrating resources, managing dependencies, handling failures gracefully, and ensuring reproducibility across thousands of experiments. Traditional approaches using simple job schedulers or manual cluster management quickly become untenable. This is where Ray and Kubernetes shine—together, they provide a powerful foundation for building production-grade AI infrastructure.
Understanding Ray: The AI Computing Framework
Ray is fundamentally different from traditional distributed computing frameworks. While Apache Spark excels at data processing and Dask handles array computations, Ray was built from the ground up for AI workloads. It provides a simple Python API that lets you scale from a laptop to a cluster without changing your code.
At its core, Ray provides three key abstractions:
- Tasks: Stateless functions that can be executed remotely. Perfect for data preprocessing, batch inference, or any embarrassingly parallel workload.
- Actors: Stateful workers that maintain their own memory. Essential for model serving, maintaining connection pools, or implementing parameter servers.
- Objects: Immutable values stored in Ray's distributed shared memory. This allows efficient data sharing without serialization overhead.
Here's a simple example that demonstrates Ray's power:
import ray
import numpy as np
from transformers import pipeline
@ray.remote
class ModelServer:
def __init__(self, model_name):
self.model = pipeline("sentiment-analysis", model=model_name)
def predict(self, texts):
return self.model(texts)
# Initialize Ray cluster
ray.init(address="ray://ray-head:10001")
# Deploy 10 model replicas across the cluster
servers = [ModelServer.remote("distilbert-base") for _ in range(10)]
# Process 100,000 texts in parallel
texts = load_texts() # Your data loading logic
chunk_size = 1000
futures = []
for i in range(0, len(texts), chunk_size):
server = servers[i % len(servers)]
chunk = texts[i:i+chunk_size]
futures.append(server.predict.remote(chunk))
# Gather results
results = ray.get(futures)
This code automatically handles load balancing, fault tolerance, and resource management. If a node fails, Ray automatically reschedules the work. If you need more throughput, just add more replicas.
Kubernetes: The Orchestration Layer
While Ray handles the AI workload orchestration, Kubernetes manages the underlying infrastructure. It provides automatic scaling, self-healing, service discovery, and declarative configuration. The combination is powerful: Kubernetes ensures your Ray cluster is always healthy and properly sized, while Ray ensures your AI workloads run efficiently.
A production Ray-on-Kubernetes deployment typically consists of:
- Ray Head Node: The cluster coordinator, running as a Kubernetes StatefulSet for stability
- Ray Worker Nodes: The compute nodes, deployed as a Deployment or StatefulSet with autoscaling
- Shared Storage: NFS, EFS, or object storage for dataset and checkpoint management
- Monitoring Stack: Prometheus, Grafana, and Ray Dashboard for observability
- Ingress Controller: For exposing Ray Dashboard and serving endpoints
Architecture Deep Dive: Building for Scale
Let me walk you through the architecture we implemented that scales to thousands of GPUs. The key insight was separating concerns into distinct layers, each optimized for its specific role.
Layer 1: Infrastructure Foundation
At the base, we have Kubernetes managing the raw compute resources. We use node pools with different characteristics:
- CPU Pool: For data preprocessing, feature engineering, and lightweight tasks
- GPU Pool: For training and inference, with A100s for large models and T4s for inference
- Memory-Optimized Pool: For data loading and caching, with 512GB+ RAM per node
- Spot/Preemptible Pool: For fault-tolerant batch workloads, saving 60-80% on compute costs
Layer 2: Ray Cluster Management
We deploy Ray using the KubeRay operator, which provides CRDs (Custom Resource Definitions) for managing Ray clusters declaratively:
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
name: ai-pipeline-cluster
spec:
rayVersion: '2.9.0'
headGroupSpec:
serviceType: ClusterIP
replicas: 1
rayStartParams:
dashboard-host: '0.0.0.0'
object-store-memory: '2147483648'
template:
spec:
containers:
- name: ray-head
image: rayproject/ray-ml:2.9.0-gpu
resources:
limits:
cpu: 8
memory: 32Gi
requests:
cpu: 4
memory: 16Gi
volumeMounts:
- mountPath: /shared
name: shared-storage
volumes:
- name: shared-storage
persistentVolumeClaim:
claimName: ray-shared-storage
workerGroupSpecs:
- replicas: 10
minReplicas: 2
maxReplicas: 100
groupName: gpu-workers
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray-ml:2.9.0-gpu
resources:
limits:
nvidia.com/gpu: 1
cpu: 16
memory: 64Gi
requests:
nvidia.com/gpu: 1
cpu: 8
memory: 32Gi
Layer 3: Workload Orchestration
On top of the Ray cluster, we implement different patterns for different workload types:
Pattern 1: Distributed Training with Ray Train
from ray import train
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig, RunConfig, CheckpointConfig
def train_func(config):
# Your PyTorch training loop
model = create_model(config["model_size"])
optimizer = torch.optim.AdamW(model.parameters(), lr=config["lr"])
for epoch in range(config["epochs"]):
for batch in train_dataloader:
loss = train_step(model, batch, optimizer)
train.report({"loss": loss, "epoch": epoch})
if epoch % 5 == 0:
train.report({}, checkpoint=create_checkpoint(model))
trainer = TorchTrainer(
train_func,
train_loop_config={
"model_size": "7B",
"lr": 1e-4,
"epochs": 100
},
scaling_config=ScalingConfig(
num_workers=16,
use_gpu=True,
resources_per_worker={"GPU": 1, "CPU": 8}
),
run_config=RunConfig(
name="llm-finetuning",
storage_path="s3://my-bucket/experiments",
checkpoint_config=CheckpointConfig(
num_to_keep=3,
checkpoint_score_attribute="loss",
checkpoint_score_order="min"
)
)
)
result = trainer.fit()
Pattern 2: Hyperparameter Tuning with Ray Tune
from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.hyperopt import HyperOptSearch
search_space = {
"lr": tune.loguniform(1e-6, 1e-2),
"batch_size": tune.choice([16, 32, 64, 128]),
"dropout": tune.uniform(0.1, 0.5),
"num_layers": tune.randint(4, 12),
"hidden_size": tune.choice([512, 768, 1024, 2048])
}
scheduler = ASHAScheduler(
max_t=100,
grace_period=10,
reduction_factor=2
)
search_alg = HyperOptSearch(
metric="validation_loss",
mode="min"
)
tuner = tune.Tuner(
trainable=train_func,
param_space=search_space,
tune_config=tune.TuneConfig(
num_samples=1000,
scheduler=scheduler,
search_alg=search_alg,
max_concurrent_trials=50
),
run_config=RunConfig(
name="hyperopt-search",
storage_path="s3://my-bucket/hpo"
)
)
results = tuner.fit()
Production Deployment: Real-World Implementation
Let me share the complete setup we use in production, handling 50+ TB of data daily and training hundreds of models concurrently.
Step 1: Infrastructure Setup
First, provision your Kubernetes cluster with appropriate node pools:
#!/bin/bash
# EKS cluster creation with mixed instance types
eksctl create cluster \
--name ai-pipeline-cluster \
--region us-west-2 \
--version 1.28 \
--nodegroup-name cpu-nodes \
--node-type m6i.4xlarge \
--nodes 5 \
--nodes-min 2 \
--nodes-max 20
# Add GPU node group
eksctl create nodegroup \
--cluster ai-pipeline-cluster \
--name gpu-nodes \
--node-type g5.12xlarge \
--nodes 3 \
--nodes-min 1 \
--nodes-max 10 \
--node-labels workload=gpu \
--node-taints nvidia.com/gpu=true:NoSchedule
# Add spot instance node group for cost optimization
eksctl create nodegroup \
--cluster ai-pipeline-cluster \
--name spot-nodes \
--spot \
--instance-types m5.4xlarge,m5a.4xlarge,m5n.4xlarge \
--nodes 10 \
--nodes-min 5 \
--nodes-max 50
Step 2: Storage Configuration
Set up high-performance storage for datasets and checkpoints:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ray-shared-storage
spec:
accessModes:
- ReadWriteMany
storageClassName: efs-sc
resources:
requests:
storage: 10Ti
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "16000"
throughput: "1000"
volumeBindingMode: WaitForFirstConsumer
Step 3: Deploy Ray with KubeRay
# Install KubeRay operator
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator \
--namespace ray-system \
--create-namespace \
--set image.tag=v1.0.0
# Deploy Ray cluster
kubectl apply -f ray-cluster.yaml
Step 4: Monitoring and Observability
Deploy comprehensive monitoring to track cluster health and job performance:
apiVersion: v1
kind: ConfigMap
metadata:
name: ray-prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'ray-metrics'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- ray-system
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_ray_io_node_type]
action: keep
regex: head|worker
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?
replacement: $1:8080
Advanced Patterns and Optimizations
After running this architecture for two years, we've discovered several patterns that significantly improve performance and reduce costs.
Pattern 1: Heterogeneous Scheduling
Not all tasks need GPUs. We use Ray's placement groups to ensure optimal resource utilization:
import ray
from ray.util.placement_group import placement_group
# Create placement group for mixed workload
pg = placement_group([
{"CPU": 4}, # Data loading
{"CPU": 8, "GPU": 1}, # Model training
{"CPU": 2}, # Metrics aggregation
])
@ray.remote(num_cpus=4, placement_group=pg,
placement_group_bundle_index=0)
def load_data():
# CPU-intensive data preprocessing
pass
@ray.remote(num_cpus=8, num_gpus=1,
placement_group=pg,
placement_group_bundle_index=1)
def train_model(data):
# GPU-accelerated training
pass
@ray.remote(num_cpus=2, placement_group=pg,
placement_group_bundle_index=2)
def aggregate_metrics(results):
# Light CPU work
pass
Pattern 2: Efficient Data Loading
Data loading often becomes the bottleneck. We use Ray Datasets for efficient distributed data processing:
import ray.data
# Create dataset from S3
ds = ray.data.read_parquet("s3://bucket/training-data/")
# Distributed preprocessing pipeline
ds = ds.map_batches(
preprocess_batch,
batch_size=1024,
num_gpus=0.1, # Fractional GPU for data augmentation
num_cpus=2
)
# Stream to training
ds = ds.window(blocks_per_window=10)
for batch in ds.iter_batches(batch_size=256):
train_step(model, batch)
Pattern 3: Fault Tolerance and Checkpointing
In distributed training, failures are inevitable. We implement comprehensive fault tolerance:
from ray.train.torch import TorchCheckpoint
import torch
class FaultTolerantTrainer:
def __init__(self, checkpoint_dir="s3://bucket/checkpoints"):
self.checkpoint_dir = checkpoint_dir
@ray.remote
def train_with_recovery(self, config):
# Try to restore from checkpoint
checkpoint = self.load_latest_checkpoint()
if checkpoint:
model = checkpoint["model"]
optimizer = checkpoint["optimizer"]
start_epoch = checkpoint["epoch"]
print(f"Resuming from epoch {start_epoch}")
else:
model = create_model(config)
optimizer = create_optimizer(model, config)
start_epoch = 0
for epoch in range(start_epoch, config["max_epochs"]):
try:
for batch in get_data_loader():
loss = train_step(model, batch, optimizer)
# Checkpoint every N epochs
if epoch % 5 == 0:
self.save_checkpoint({
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"epoch": epoch,
"config": config
})
except Exception as e:
print(f"Error at epoch {epoch}: {e}")
# Ray will automatically retry the task
raise
Performance Tuning and Optimization
Getting optimal performance requires careful tuning. Here are the key metrics and optimizations that made the biggest difference in our deployment.
GPU Utilization Optimization
We initially saw only 40% GPU utilization. After optimization, we consistently achieve 85-95%:
- Data Pipeline: Implement prefetching and parallel data loading to ensure GPUs never wait for data
- Mixed Precision: Use FP16/BF16 training to double throughput on modern GPUs
- Gradient Accumulation: Simulate larger batch sizes without increasing memory usage
- Multi-GPU Strategies: Choose between DDP, FSDP, and DeepSpeed based on model size
Cost Optimization
Our optimizations reduced costs by 73% while improving throughput:
- Spot Instances: Use for fault-tolerant workloads with 60-80% cost savings
- Reserved Instances: For baseline capacity with 40% savings
- Autoscaling: Scale down during off-peak hours
- Resource Packing: Optimize container resource requests to maximize node utilization
Debugging and Troubleshooting
When things go wrong (and they will), having proper debugging tools is essential. Here's our debugging toolkit:
Ray Dashboard
Access the Ray dashboard for real-time cluster monitoring:
kubectl port-forward svc/ray-head-svc 8265:8265
# Access at http://localhost:8265
Distributed Debugging
Use Ray's built-in debugger for distributed applications:
import ray
from ray import debug
ray.init()
debug.set_trace() # Breakpoint in distributed code
@ray.remote
def problematic_function(x):
debug.set_trace() # Remote breakpoint
result = complex_computation(x)
return result
Performance Profiling
Profile distributed workloads to identify bottlenecks:
from ray.profiling import profile
@ray.remote
@profile(name="data_processing")
def process_batch(batch):
# Your processing logic
pass
# View profiling results in Ray Dashboard
Security and Compliance
Production AI pipelines must be secure. We implement defense in depth:
- Network Policies: Restrict pod-to-pod communication using Kubernetes NetworkPolicies
- RBAC: Fine-grained access control for different teams and workloads
- Secrets Management: Use Kubernetes Secrets with encryption at rest for API keys and credentials
- Image Scanning: Scan all container images for vulnerabilities before deployment
- Audit Logging: Complete audit trail of all API calls and data access
Real-World Case Studies
Let me share three real implementations that showcase the power of this architecture.
Case Study 1: Large Language Model Training
We fine-tuned a 70B parameter model across 64 A100 GPUs:
- Challenge: Model doesn't fit on a single GPU
- Solution: Used FSDP (Fully Sharded Data Parallel) with Ray Train
- Result: 12x speedup compared to single-node training, completed in 48 hours instead of 3 weeks
Case Study 2: Computer Vision Pipeline
Processed 100 million images for object detection and classification:
- Challenge: Heterogeneous workload with different resource requirements
- Solution: Ray Workflows for orchestration with mixed CPU/GPU processing
- Result: Processed entire dataset in 72 hours using spot instances, 80% cost reduction
Case Study 3: Real-Time Inference
Served 1 billion predictions daily with <100ms p99 latency:
- Challenge: Variable traffic patterns with 10x peak loads
- Solution: Ray Serve with automatic scaling based on request queue depth
- Result: Maintained SLA during Black Friday traffic spike, zero downtime in 18 months
Future Directions and Emerging Trends
The landscape of AI infrastructure is rapidly evolving. Here's what we're exploring for 2025 and beyond:
- Confidential Computing: Training on encrypted data using Intel SGX and AMD SEV
- Federated Learning: Distributed training across edge devices without centralizing data
- Quantum-Classical Hybrid: Integrating quantum processors for specific optimization tasks
- Neuromorphic Hardware: Exploring brain-inspired computing for ultra-low power inference
- Sustainable AI: Carbon-aware scheduling to minimize environmental impact
Lessons Learned and Best Practices
After two years of running this architecture in production, here are the key lessons:
- Start Simple: Begin with a small cluster and scale gradually. Complexity is the enemy of reliability.
- Monitor Everything: You can't optimize what you can't measure. Instrument every component.
- Plan for Failure: Assume everything will fail. Design for resilience from day one.
- Optimize Iteratively: Don't try to optimize everything at once. Focus on the biggest bottlenecks first.
- Invest in Automation: Manual processes don't scale. Automate deployment, monitoring, and recovery.
- Document Extensively: Your future self (and team) will thank you for comprehensive documentation.
- Stay Current: The ecosystem evolves rapidly. Regularly evaluate new tools and techniques.
Conclusion: The Path Forward
Building scalable AI pipelines with Kubernetes and Ray isn't just about technology—it's about enabling innovation. When data scientists can iterate quickly without worrying about infrastructure, when models can scale from prototype to production seamlessly, and when costs are optimized automatically, that's when real breakthroughs happen.
The architecture we've discussed here powers some of the largest AI deployments in production today. It's battle-tested, scalable, and continuously evolving. Whether you're processing thousands or billions of data points, training small models or LLMs, serving batch predictions or real-time inference, this foundation will serve you well.
Remember, the goal isn't to build the most complex system—it's to build the right system for your needs. Start with the basics, measure everything, and scale based on actual requirements rather than hypothetical scenarios. The combination of Ray and Kubernetes provides the flexibility to grow from a single GPU to thousands, from megabytes to petabytes, from prototype to production.
The future of AI is distributed, and with the right architecture, you're ready to build it.