FAANG
November 2023

Creating a Cloud HPC solution for a FAANG company

Cloud HPC solutions, while not being as performant due to cloud constraints provide a provide a flexible and cost effective environment.

Managed project

HPC Cluster on AWS

Technologies

AWS, Terraform, Slurm

Client Challenge

The AI/ML explosion required more researchers and more GPU/Researcher. On premises HPC clusters have a few advantages (customizable, performance, security and control) while having many disadvantages (massive upfront investment, long ROI, takes years to build making fast pacing hardware obsolete).

Cloud HPC solutions, while not being as performant due to cloud constraints (hardware co-location, storage technologies, network constraints) provide a flexible and cost effective environment making it ideal for testing cutting edge hardware and when having overflow capacity.

This cloud offerings have limited features, sometimes making hard to adopt.

Ingratiation with internal services was a priority.

Solution Delivered

HPC Slurm cluster deployed on AWS using AWS ParallelCluster as base layer and boosted with many custom features to get a production ready environment: Secure access internal users. Unix users management. Secure access. 2FA. S3 data pipelines. Support for Multiple FSx for Lustre. Slurm partitions and limits. Slurm Accounting. Observability. Hardware testing. Login Nodes. Support for multiple tenants on different accounts. Persistent $HOME. Lustre eviction. Capacity planning. Custom safeguards for AWS services. Over time an Azure cluster was also added to the stack using Cycle Cloud. Tech stack: Terraform. Packer. AWS (EC2 + EFA, FSx, EFS, S3, SES, SNS, SQS, Step Functions, Cognito DynamoDB, CloudWatch). PyTorch + NCCL. DUO

Project Results

  • 500+ researchers
  • 20+ clusters
  • 5+ accounts/tenants
  • 6000+ GPUs under management
  • multiple PB on S3/FSx
  • The AWS ParallelCluster project took many ideas from this engagement

Let’s get in touch

Ready to scale
with us?

Contact Us
Renaiss © Code | Designed by us with love
Renaissance Software LLC