Evolving AI Research Infrastructure with Kubernetes at Meta: Overcoming Challenges and Lessons Learned

Audience:

As the adoption of Kubernetes (k8s) continues to grow in the industry, migrating well-established research workloads to this container orchestration platform poses unique challenges. At Meta, our Cloud Infra team embarked on a journey to transition our Slurm based AI research infrastructure to k8s while maintaining a seamless experience for researchers. In this talk, we will delve into the problems we encountered and the innovative solutions we developed to overcome them.

One of the primary concerns was ensuring that researchers could continue working without needing to learn new tools or worrying about the underlying infrastructure. We achieved this by creating a custom interface that abstracts away the complexity of k8s interactions. Additionally, we had to address issues related to provisioning, authentication, and access control, which required us to develop novel solutions that integrate with the k8s ecosystem.

We also had to rethink how we manage resources such as login nodes, data volumes, and network traffic. By leveraging k8s features and a combination of open source and custom components, we were able to provide flexible and scalable infrastructure that meets the needs of our researchers.

However, one of the most significant challenges we faced was adapting to a world without traditional configuration management tools like Chef or Ansible. To manage our hosts, we had to rely on Helm and DaemonSets, which required us to rethink our approach to dynamic configuration changes. Common systemd patterns became s6 configuration running in containers. Despite these hurdles, this new approach led to some unexpected benefits, including increased consistency and reduced drift, as well as a more streamlined and efficient infrastructure management process.

Join us as we share the story of our journey, highlighting the technical challenges we faced and the solutions we developed to overcome them. We'll dive into the details of our architecture, discuss the trade-offs we made, and explore the lessons we learned along the way. If you're considering migrating your own research workloads to k8s, or simply want to learn from our experiences, this talk is for you.

Time:
Friday, November 1, 2024 - 11:15