Creating a Cloud HPC solution for a FAANG company

Cloud HPC solutions, while not being as performant due to cloud constraints provide a provide a flexible and cost effective environment.

Offering

HPC Cluster on AWS

Technologies

AWS, Terraform, Slurm

Client Challenge

The AI/ML explosion changed the economics of research infrastructure overnight. A FAANG company needed to scale GPU capacity fast — more researchers, more compute, more experiments running in parallel — but on-premises HPC clusters couldn't keep up.

On-premises infrastructure has real advantages: customization, performance, security, and control. But it also carries serious disadvantages: massive upfront investment, long ROI cycles, years-long build timelines, and hardware that becomes obsolete before it's fully utilized. When AI research demands accelerate, waiting years to expand capacity is not an option.

Cloud HPC offered a different trade-off: more flexibility, faster provisioning, and cost efficiency for overflow capacity and cutting-edge hardware testing. The challenge was that cloud HPC offerings have limited native features and are notoriously difficult to adopt at enterprise scale. Integration with internal services was a hard requirement — not a nice-to-have.

Solution Delivered

Renaiss designed and deployed a production-ready HPC Slurm cluster on AWS using AWS ParallelCluster as the base layer, heavily customized to meet enterprise requirements.

The architecture went well beyond a standard ParallelCluster deployment. Key capabilities built on top of the base layer included secure access for internal users, Unix user management, two-factor authentication, S3 data pipelines, FSx for Lustre support across multiple configurations, Slurm partitions and limits, Slurm accounting, hardware observability, hardware testing frameworks, login nodes, and multi-tenant support across different AWS accounts. Persistent $HOME directories, Lustre eviction policies, and capacity planning tools were also implemented to support research workflows at scale.

Custom safeguards were built specifically for AWS services to prevent runaway costs and enforce governance. Over time, an Azure cluster was added to the stack using Cycle Cloud, expanding the solution to a true multi-cloud environment.

Full tech stack: Terraform, Packer, AWS (EC2, EFA, FSx, EFS, S3, SES, SNS, SQS, Step Functions, Cognito, DynamoDB, CloudWatch), PyTorch, NCCL, DUO.

Project Results

  • The platform scaled to support over 500 researchers across more than 20 clusters, spanning 5+ accounts and tenants. At peak, the infrastructure managed more than 6,000 GPUs under active management and multiple petabytes of data on S3 and FSx.
  • The engagement had an impact beyond the client: AWS ParallelCluster incorporated several ideas developed during this project into its roadmap, a recognition of the technical depth and novelty of the work Renaiss contributed.
  • The result was a research infrastructure that could scale with AI demand — provisioning new clusters in hours instead of years, supporting hundreds of researchers simultaneously, and integrating seamlessly with internal enterprise systems.
Go back

Let's get in touch!

Ready to scale with us?

Contact us!
Renaiss © Code | Designed by us with love
Renaissance Software LLC