Cloud HPC solutions, while not being as performant due to cloud constraints provide a provide a flexible and cost effective environment.
HPC Cluster on AWS
AWS, Terraform, Slurm
The AI/ML explosion changed the economics of research infrastructure overnight. A FAANG company needed to scale GPU capacity fast — more researchers, more compute, more experiments running in parallel — but on-premises HPC clusters couldn't keep up.
On-premises infrastructure has real advantages: customization, performance, security, and control. But it also carries serious disadvantages: massive upfront investment, long ROI cycles, years-long build timelines, and hardware that becomes obsolete before it's fully utilized. When AI research demands accelerate, waiting years to expand capacity is not an option.
Cloud HPC offered a different trade-off: more flexibility, faster provisioning, and cost efficiency for overflow capacity and cutting-edge hardware testing. The challenge was that cloud HPC offerings have limited native features and are notoriously difficult to adopt at enterprise scale. Integration with internal services was a hard requirement — not a nice-to-have.
Renaiss designed and deployed a production-ready HPC Slurm cluster on AWS using AWS ParallelCluster as the base layer, heavily customized to meet enterprise requirements.
The architecture went well beyond a standard ParallelCluster deployment. Key capabilities built on top of the base layer included secure access for internal users, Unix user management, two-factor authentication, S3 data pipelines, FSx for Lustre support across multiple configurations, Slurm partitions and limits, Slurm accounting, hardware observability, hardware testing frameworks, login nodes, and multi-tenant support across different AWS accounts. Persistent $HOME directories, Lustre eviction policies, and capacity planning tools were also implemented to support research workflows at scale.
Custom safeguards were built specifically for AWS services to prevent runaway costs and enforce governance. Over time, an Azure cluster was added to the stack using Cycle Cloud, expanding the solution to a true multi-cloud environment.
Full tech stack: Terraform, Packer, AWS (EC2, EFA, FSx, EFS, S3, SES, SNS, SQS, Step Functions, Cognito, DynamoDB, CloudWatch), PyTorch, NCCL, DUO.
