Docker to Kubernetes: A Migration Story

We ran our production infrastructure on Docker Compose for two years. It served us well — simple, fast to iterate on, and easy for the whole team to understand. But as traffic grew and our service count increased, cracks began to show. Manual restarts, no self-healing, painful rolling deployments, and per-server SSH sessions became weekly frustrations.

This is the story of our migration to Kubernetes — what we planned, what surprised us, and what we would do differently.

Why We Waited (And Why We Were Right To)

Kubernetes has real operational overhead. If you have fewer than a handful of services and a small team, Docker Compose or a managed platform (Railway, Render, Fly.io) is probably the right call. We waited until the pain of not having Kubernetes was measurably costing us time and reliability.

The trigger for us was a combination of three things: we needed zero-downtime deployments with health-check gating, we wanted autoscaling for a batch-processing service with spiky load, and we needed better secret management across environments.

The Migration Approach

We chose a strangler fig approach — migrate service by service rather than cutting over all at once:

Week 1-2: Set up the Kubernetes cluster (GKE Autopilot) and establish cluster conventions — namespace structure, RBAC, resource requests/limits, and a shared Ingress pattern.
Week 3-4: Migrate stateless services first. These are the simplest — a Deployment, a Service, and an Ingress entry. We started with our image processing workers and API gateway.
Week 5-6: Tackle the stateful services. PostgreSQL moved to Cloud SQL (managed, not in-cluster). Redis stayed in-cluster using a StatefulSet with persistent volumes.
Week 7: DNS cutover. We ran old and new infrastructure in parallel for one week with traffic mirroring before flipping the canonical DNS entries.

What Surprised Us

Resource requests and limits are not optional. In Compose, containers ran unconstrained. In Kubernetes, without resource requests, the scheduler cannot make placement decisions and your pods will be evicted during node pressure. Tuning these correctly took two weeks of observation with VPA recommendations.

Liveness vs readiness vs startup probes actually matter. We had a service that took 45 seconds to warm its in-memory cache. Without a startup probe, Kubernetes kept killing it and restarting it in a loop before it ever became healthy.

Secret management is worth investing in properly. We ended up using External Secrets Operator with GCP Secret Manager. The alternative — baking secrets into ConfigMaps or environment variables in YAML — is an audit nightmare.

The Results

Zero-downtime deployments: rolling updates with readiness gates mean traffic never hits a pod that is not ready.
Autoscaling: our batch workers scale from 2 to 40 replicas in under 90 seconds during peak load using KEDA with a queue-depth metric.
Mean time to recovery dropped from ~8 minutes (manual intervention) to ~45 seconds (Kubernetes self-healing).
Deployment frequency increased because engineers no longer feared deployments.

What We Would Do Differently

Invest in a proper Helm chart structure from day one. We started with raw YAML and spent two weeks later converting to Helm. The parameterisation and release management it gives you is worth the upfront effort.

Set up a developer preview environment cluster earlier. Reducing the gap between local development and Kubernetes was the biggest productivity unlock for the engineering team.