The Anthropic Claude API: A Developer's Guide to Building With It
A practical developer's guide to building with the Anthropic Claude API — authentication, model selection, tool use, streaming, prompt caching, and production deployment patterns.

James Ross Jr.
Strategic Systems Architect & Enterprise Software Developer
Why I Build on Claude
Before getting into the technical guide, I want to be transparent about my tooling choices. I build on the Anthropic Claude API as my primary LLM platform for AI applications. That's a deliberate choice, not a default.
The reasons: Claude's performance on complex reasoning and instruction-following tasks is excellent for the enterprise software work I do. The context window is large enough to handle substantial codebases and documents. The structured output capabilities are production-grade. And the API design is clean — the Anthropic SDK is one of the better-designed AI client libraries available.
That said, this guide is about how to build with the Claude API effectively. The patterns apply broadly and I'll note where you'd adapt them for other providers.
Getting Started: Authentication and SDK Setup
The Anthropic API uses API key authentication. The TypeScript SDK is my environment of choice; there's also a Python SDK with equivalent capabilities.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
The key should live in environment variables, never hardcoded. In production, use your secrets management system (AWS Secrets Manager, Doppler, whatever your infrastructure uses). In development, a .env file with dotenv is fine.
One pattern I enforce in every project: the Anthropic client is initialized exactly once in a shared module and imported wherever needed. Creating new client instances per request is wasteful and creates connection management overhead.
Model Selection: The Right Model for the Task
Anthropic offers a model family with different capability and cost profiles. The selection decision matters for both quality and cost.
Claude Opus is the most capable model for complex reasoning — nuanced analysis, multi-step problem solving, tasks that require careful judgment. It's also the most expensive per token. I use it for the tasks where quality matters most: code architecture review, complex document analysis, high-stakes content generation.
Claude Sonnet is the model I use most in production applications. It delivers strong performance on a wide range of tasks at a significantly lower cost than Opus. For the majority of AI application tasks — document processing, code generation, structured data extraction, conversational interfaces — Sonnet is the right default.
Claude Haiku is optimized for speed and cost. I use it for high-volume, lower-complexity tasks: classification, simple extraction, real-time features where latency matters more than maximum quality. The cost per token is dramatically lower, which matters at scale.
The practical pattern: define task types in your application and assign model tiers to them. Route requests to the appropriate model based on task type. This multi-tier approach is one of the most impactful cost optimizations in AI application development.
The Core API: Messages
The messages API is the foundation of Claude API usage. The key concepts:
System prompt: The instruction context that shapes how Claude responds throughout the conversation. This is where you define role, constraints, output format requirements, and context about the application. Invest heavily in your system prompts.
Messages array: The conversation history. Each message has a role (user or assistant) and content. For multi-turn conversations, include the full history. For single-turn requests, a single user message is sufficient.
Structured outputs: For production applications, always use structured outputs when you need reliable response formats. Define a JSON schema and use the API's structured output mode.
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: "You are a document classification system. Classify documents into categories.",
messages: [
{
role: "user",
content: `Classify this document: ${document}`,
},
],
});
Tool Use: Building Agentic Capabilities
Tool use (also called function calling) is the capability that enables agentic applications — where Claude can take actions, not just generate text. You define tools as functions with JSON schemas, provide them to the model, and Claude decides when and how to call them.
The pattern: define your tools with clear names, descriptions, and parameter schemas. Claude uses the name and description to decide when to call the tool, and the parameter schema to know what to pass. Good tool descriptions are as important as good prompts.
const tools = [
{
name: "search_knowledge_base",
description:
"Search the company knowledge base for relevant documentation. Use this when the user asks about company policies, procedures, or product information.",
input_schema: {
type: "object",
properties: {
query: {
type: "string",
description: "The search query",
},
max_results: {
type: "number",
description: "Maximum number of results to return (1-10)",
},
},
required: ["query"],
},
},
];
When Claude decides to call a tool, it returns a tool_use content block with the tool name and input. Your application executes the actual function and returns the result as a tool_result message. Claude then continues its response incorporating the result.
This loop — model decides to call tool, application executes, result returned to model — is the fundamental pattern for agentic applications. Multiple tool calls can happen in sequence or in parallel, building up the information needed to complete a complex task.
Streaming: The User Experience Imperative
For user-facing AI features, streaming is mandatory. Waiting for a complete response before showing anything creates a poor user experience — users see nothing for 3-10 seconds, then a wall of text appears.
Streaming returns tokens as they're generated, allowing your UI to display content progressively. The difference in perceived performance is significant.
const stream = await client.messages.stream({
model: "claude-sonnet-4-6",
max_tokens: 2048,
messages: [{ role: "user", content: userMessage }],
});
for await (const chunk of stream) {
if (
chunk.type === "content_block_delta" &&
chunk.delta.type === "text_delta"
) {
process.stdout.write(chunk.delta.text);
}
}
In a web application, you'd stream these tokens to the client over a Server-Sent Events connection or a WebSocket. The client appends each token to the displayed content as it arrives.
Prompt Caching: The Cost Optimization You Should Use
Prompt caching is a capability that reduces costs significantly for applications with large, stable system prompts or repeated context. When you mark content as cacheable, Anthropic stores the processed representation of that content and reuses it across requests, charging a reduced rate for cache hits.
The use cases where caching creates meaningful savings: applications with large system prompts that don't change per request, RAG applications that include the same reference documents in many requests, applications that process a large document many times with different questions.
Implementing caching requires marking content blocks with cache_control: { type: "ephemeral" }. The cache is maintained for up to 5 minutes by default, with extended options available. On a sufficiently large prompt with high request volume, caching can reduce costs by 70-90% on the cached portion.
Error Handling and Retry Logic
Production API usage requires robust error handling. The Claude API returns structured errors that you should handle explicitly:
- Rate limit errors (429): Implement exponential backoff with jitter. Don't hammer the API on rate limit.
- Server errors (500, 529): Transient; retry with backoff.
- Invalid request errors (400): Usually prompt or parameter issues; don't retry without fixing the request.
- Authentication errors (401): API key issue; don't retry, alert the operations team.
The pattern I use: a retry wrapper around all API calls with classification of retryable vs. non-retryable errors, exponential backoff for retryable errors, dead letter logging for non-retryable errors.
Observability in Production
For production applications, every API call should be logged with: the model used, the token counts (input and output), the latency, the result type (success/error), and a correlation ID that links the API call to the user request that triggered it.
This gives you: cost tracking per feature and per user, latency percentile data, error rate monitoring, and the ability to trace AI behavior back to specific user interactions when debugging.
Without this logging, you're operating AI features blind. The cost of adding structured logging is minimal; the value when something goes wrong is significant.
If you're building a production application on the Claude API and want experienced architecture guidance on integration patterns, cost optimization, and observability, book a conversation at Calendly. I build with this API daily and can help you structure your integration for reliability and cost efficiency.