Five Hard Lessons from Recovering a Catastrophic Microservices Migration

2025-11-212 min read

At QCon San Francisco, Sonya Natanzon, VP of Engineering at HeartFlow, presented a compelling account of navigating the recovery process after inheriting a catastrophic identity migration. This migration, intended to move a healthcare portal from a monolithic architecture to microservices using a commercial identity provider, failed immediately upon release, locking all users out. Natanzon stepped into a challenging situation, facing eroded trust and a critical need to restore system stability and team credibility. Her five hard-won lessons offer invaluable guidance for any organization facing similar architectural crises.

1. Balance Forward Progress with Damage Control

Natanzon's first crucial lesson was the necessity of balancing immediate damage control with making forward progress. For users, the priority was demonstrating portal availability, emphasizing reliability over new features. For business partners, the focus shifted to proving that architectural improvements could deliver concrete business value, moving away from isolated technical perfection. A key strategic decision was to abandon "big bang" releases in favor of incremental delivery, which provides business value more quickly.

2. Own the Spotlight

In a significant shift from previous team behavior, Natanzon championed proactive and transparent communication. This meant openly sharing progress, setbacks, and realistic timelines with stakeholders. She stressed that transparency about challenges builds trust far more effectively than defensive posturing, allowing the team to rebuild confidence.

3. Make It Better for Now, Not for the Future

Contrary to the common impulse to build robust, future-proof systems after a failure, Natanzon advocated for a pragmatic approach: build for immediate needs. The team focused on delivering tangible improvements quickly and relentlessly pruned parts of the system that didn't provide concrete business value. This strategy allowed them to demonstrate rapid progress rather than getting bogged down in architectural perfectionism.

4. Perception Management Matters

Technical teams often dismiss concerns about perception, but Natanzon argued that perception directly impacts a team's ability to execute. Negative perceptions can linger long after the underlying problems are resolved, and they are often emotional, making cold, hard data insufficient to change them. Her recommendation included building strong relationships, consistently engaging stakeholders, and promptly addressing perceived problems.

5. Pay Attention to the Team

Natanzon emphasized that the team itself is a "patient" in architectural disaster recovery. She stabilized her team through improved documentation and better onboarding practices. Fundamentally, she transformed the culture from one of knowledge silos and individual achievement to one of collaboration, transparency, and collective success. Interestingly, the attrition that followed the initial failure inadvertently made it easier to implement these cultural changes.

Natanzon's experience reinforces broader industry lessons about the complexities of microservices migrations. Her recovery playbook provides a valuable template for effectively responding when architectural initiatives go catastrophically wrong, underscoring that technical solutions alone are insufficient without organizational trust and team cohesion.