Dancing with the Pods: Live migration of a database fleet while serving millions of queries
At ClickHouse, we recently changed the way we orchestrate databases provisioned by customers, specifically the way we use statefulsets. This was done through a painstaking refactoring of our operator. There was just one big problem: we wanted to migrate our legacy fleet of thousands of services from the old orchestration code-path to the new one without any downtime - even the queries should continue to run as they are. If there is one thing that people hate doing - it is migrations. They are painful, have lots of corner cases, and take a long time. In our case, it took us almost 6 months to migrate the entire fleet. But we encountered lots of interesting challenges along the way. This talk will walk you through these challenges of live migrating the entire ClickHouse Cloud Fleet's orchestration while continuing to serve customer queries and ingest. The story involves our Operator, deep-dive into StatefulSets, a custom migration controller, semaphores, building a maintenance mode for the cloud product, durable execution workflows, and many, many database synchronization challenges.