Designing APIs for AI-First Applications

Building APIs for AI-powered applications introduces constraints and patterns that traditional API design wisdom does not fully address. Streaming responses, long-running inference jobs, prompt management endpoints, and embedding search routes all require intentional design decisions.

This post distills the patterns we have found most durable across the AI applications we have built.

Streaming Is a First-Class Concern

LLM inference is slow relative to traditional API responses. A GPT-4o call for a long output can take 10-30 seconds. Returning this as a single HTTP response gives users a blank screen with a spinner — a terrible experience.

Design your AI endpoints to stream from the start. Use Server-Sent Events (SSE) for simplicity or WebSockets when you need bidirectional communication. Structure your API so streaming is the default, not an afterthought.

typescript

// SSE streaming endpoint (Node.js / Express)
app.get('/api/chat/stream', async (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');

  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: req.body.messages,
    stream: true,
  });

  for await (const chunk of stream) {
    const delta = chunk.choices[0]?.delta?.content ?? '';
    if (delta) res.write(`data: ${JSON.stringify({ text: delta })}\n\n`);
  }

  res.write('data: [DONE]\n\n');
  res.end();
});

Async Job Pattern for Long-Running Inference

For tasks that take longer than 30 seconds — document analysis, multi-step agent runs, batch embeddings — design an async job pattern: POST returns a job ID immediately, GET polls for status and result.

This keeps your API responsive, allows clients to retry status checks on reconnect, and gives you a natural place to implement job queuing, priority, and cancellation.

POST /api/jobs → { jobId, status: 'queued' }
GET /api/jobs/:id → { status: 'processing' | 'complete' | 'failed', result? }
DELETE /api/jobs/:id → cancel a running job
POST /api/jobs/:id/retry → retry a failed job

Versioning Prompts Like Code

Your system prompts and few-shot examples are part of your API contract — they determine the behaviour of your endpoints. Treat them as versioned artefacts stored in your backend, not hardcoded strings in your source code.

Expose a prompts management API or use a dedicated prompt registry. This lets you update prompt behaviour without a code deployment, A/B test prompt versions, and roll back problematic prompt changes instantly.

Semantic Search Endpoint Design

If your API exposes RAG-powered search, design the response to include both the generated answer and the source chunks that informed it. Clients need this for reference attribution, debugging, and user trust (showing 'based on document X').

Include chunk scores in the response. Expose a `minScore` query parameter so clients can filter low-confidence retrievals. Log every search query and its retrieved chunks — this data will drive your retrieval quality improvements.

Rate Limiting and Cost Controls

AI inference is expensive per request in a way that traditional CRUD operations are not. A single complex agent run can consume as much compute as thousands of database reads.

Implement token-level rate limiting, not just request-level. Track estimated token usage per user/tenant and expose usage data via your API so clients can build their own dashboards. Set hard limits that protect your margin and soft limits that trigger warnings before cutting off access.