Introducing Amazon EKS Support in SageMaker HyperPod
We are thrilled to introduce Amazon Elastic Kubernetes Service (Amazon EKS) support in Amazon SageMaker HyperPod, a purpose-built infrastructure engineered with resilience at its core. This capability allows for the seamless addition of SageMaker HyperPod managed compute to EKS clusters, using automated node and job resiliency features for foundation model (FM) development.
Resilience Features for Large-Scale Model Training
FMs are typically trained on large-scale compute clusters with hundreds or thousands of accelerators. Under such circumstances, hardware failures pose a significant challenge. Since its inception, SageMaker HyperPod was designed with managed resiliency features to mitigate such hardware failures, enabling FM builders to scale their training and inference on Slurm clusters. Now, with EKS support in HyperPod, users can also benefit from these resiliency features on Kubernetes clusters.
Testimonies from Users
AI startups and enterprises alike have seen the benefits of Amazon EKS support in SageMaker HyperPod:
- Observea, a startup, has reduced operational costs by over 30% by using HyperPod.
- Articul8 AI, an enterprise, has integrated Amazon EKS support seamlessly into their training pipelines, making it easier to manage and operate their large-scale Kubernetes clusters.
Key Features of SageMaker HyperPod on EKS
The Amazon EKS support in SageMaker HyperPod introduces significant resiliency features for large-scale model training on an EKS cluster. The three primary sections include:
- Overview of Amazon EKS support in SageMaker HyperPod
- HyperPod cluster setup and node resiliency features
- Training job resiliency with the job auto resume functionality
Architecture Overview
The Amazon EKS support in HyperPod supports a 1-to-1 mapping between the EKS cluster and the HyperPod compute, enhancing infrastructure stability and performance. Users can manage node groups, update configurations, and streamline dependencies using HyperPod APIs.
Use Cases from Industry Leaders
Noteworthy companies like Thomson Reuters, Perplexity AI, and Hugging Face have leveraged the resiliency features of SageMaker HyperPod on Kubernetes clusters to boost their FM training and inference capabilities.
With the recent addition of Amazon EKS support in SageMaker HyperPod, users have access to advanced resiliency features and seamless integration for large-scale model training. Whether you are a Kubernetes cluster administrator or an ML scientist, SageMaker HyperPod on EKS offers a robust infrastructure for managing training workloads effectively.