Duolingo's Kubernetes Leap: Migrating 500+ Services from ECS to EKS

2026-04-125 min read

Duolingo, the world's largest language-learning platform with over 128 million monthly active users, recently undertook one of the most ambitious infrastructure migrations in the edtech space: moving more than 500 backend services from AWS Elastic Container Service (ECS) to Amazon Elastic Kubernetes Service (EKS). This was not merely a technology swap — it was a fundamental shift in how the company deploys, operates, and scales its backend infrastructure.

This article breaks down the motivations behind the migration, the phased approach Duolingo took, the technical and organizational decisions that made it successful, and the lessons learned along the way.

Why Move from ECS to EKS?

ECS had served Duolingo well, but as the platform grew, several limitations became increasingly painful:

  • Ecosystem and Extensibility: Kubernetes has a vastly richer open-source ecosystem. Tools like Argo CD for GitOps, Karpenter for intelligent autoscaling, and a wide range of operators and controllers provide capabilities that are difficult or impossible to replicate on ECS.
  • Deployment Flexibility: EKS natively supports advanced deployment strategies — rolling updates, canary deployments, and blue-green deployments — with standardized, well-documented patterns. On ECS, achieving the same level of flexibility required significant custom tooling.
  • Ephemeral Environments: Spinning up short-lived, isolated environments for testing and development is straightforward on Kubernetes. This capability was a key enabler for faster iteration and higher-confidence deployments.
  • Industry Alignment: Kubernetes has become the de facto standard for container orchestration. Aligning with it simplifies hiring, onboarding, and knowledge transfer.

The Phased Migration Strategy

Migrating 500+ services at once would have been reckless. Duolingo adopted a deliberate, multi-phase approach:

Phase 1: Foundation (H2 2024)

The first phase focused on building the foundational platform. The core platform team (6–7 engineers) built the EKS infrastructure, integrated GitOps tooling (Argo CD), set up observability pipelines (Honeycomb for tracing, Sentry for error tracking, PagerDuty for alerting), and onboarded a small number of early-adopter services. This phase was about proving the platform and identifying issues before broad adoption.

Phase 2: Production Stabilization (H1 2025)

With the foundation in place, the team began shifting real production traffic to EKS for key services. Traffic was gradually shifted between ECS and EKS using weighted routing, allowing the team to compare performance and reliability side-by-side. Observability tooling was essential here — engineers could distinguish whether issues originated from the ECS or EKS side of the traffic split.

Phase 3: Expansion and Automation (2026)

The current phase focuses on scaling up. The majority of services are being migrated, with increasing levels of automation to reduce per-service migration effort. Lessons from incidents in earlier phases have been incorporated into the tooling and runbooks.

gantt
    title Duolingo ECS → EKS Migration Timeline
    dateFormat YYYY-MM
    section Foundation
    Build EKS platform & tooling       :2024-07, 2024-12
    Early adopter services              :2024-10, 2024-12
    section Stabilization
    Production traffic on EKS           :2025-01, 2025-06
    Performance validation              :2025-03, 2025-06
    section Expansion
    Broad service migration             :2025-07, 2026-06
    Automation & scaling                :2026-01, 2026-06

Key Technical Decisions

GitOps with Argo CD

All infrastructure and application configurations are managed declaratively through Git. Argo CD continuously reconciles the desired state in Git with the actual state in the cluster. This provides an audit trail for every change, standardized approval workflows through pull requests, and reliable, one-click rollbacks.

Karpenter for Autoscaling

Karpenter replaced the traditional Kubernetes Cluster Autoscaler. It provisions nodes dynamically based on pending pod requirements, selecting the optimal instance type, size, and purchase option (including Spot instances). This resulted in better resource utilization and lower compute costs.

Cellular Architecture for Isolation

Duolingo uses a "cellular architecture" where each environment (production, staging, development) runs in a separate, isolated EKS cluster. This limits the blast radius of any single failure — an issue in a development cluster cannot cascade into production. It also provides clear security boundaries between environments.

IPv6-Only Pods

Moving to IPv6-only networking for pods eliminated the IPv4 address exhaustion issues that large Kubernetes deployments frequently encounter. This was an important scalability decision for running hundreds of services with potentially thousands of pods.

Organizational Change Management

Technology migration is only half the challenge. Duolingo recognized that the organizational and human dimensions were equally critical.

Empowering Product Teams

Rather than mandating a migration schedule, the platform team provided tooling, documentation, and hands-on support, allowing each of Duolingo's 30+ product teams to control their own migration timeline. This respected each team's priorities and reduced friction.

Building Developer Trust

Many engineers had limited Kubernetes experience. Duolingo invested heavily in internal training — including dedicated "immersion days" where engineers worked directly with the platform team to learn EKS concepts and operational practices. This built confidence and reduced fear of the unknown.

Clear Communication

The platform team maintained transparent communication throughout the migration. Regular updates on progress, known issues, and upcoming changes kept the organization aligned. When incidents occurred on the new platform, they were communicated openly, reinforcing a culture of psychological safety.

Challenges and Lessons Learned

  • AWS API Rate Limits: Migrating hundreds of services in parallel exposed AWS API throttling limits. The team had to implement careful pacing and retry logic in their automation to avoid hitting rate limits during peak migration periods.
  • Observability is Non-Negotiable: Without deep observability into both the old and new platforms, the phased traffic shifting would have been impossible. Being able to compare ECS and EKS performance side-by-side gave the team confidence to proceed.
  • Start Small, Move Fast Later: The multi-phase approach paid dividends. The foundation and stabilization phases caught issues that would have been catastrophic at scale. By the time the expansion phase began, the platform was mature and well-understood.

Business Impact

The migration has delivered measurable business results:

  • Cost Optimization: Leveraging Graviton processors, Karpenter-managed Spot instances, and right-sized resource allocations resulted in double-digit percentage savings on EC2, RDS, and ElastiCache spend.
  • Deployment Speed: Standardized Kubernetes deployment patterns and GitOps workflows reduced deployment times and increased deployment frequency across teams.
  • Reliability: Cellular architecture and improved rollback capabilities have increased overall system resilience.

Conclusion

Duolingo's migration from ECS to EKS is a textbook example of how to execute a large-scale infrastructure modernization. The success was not just in the technology choices — Kubernetes, Argo CD, Karpenter — but in the deliberate phased approach, the investment in observability, and the deep commitment to organizational change management. For any organization contemplating a similar journey, Duolingo's experience demonstrates that technical excellence and human empathy must go hand in hand.


Reference: Duolingo's Kubernetes Leap