About the Role
We are building an orchestration platform that enables self-service access to GPU-powered environments across cloud, on-prem, and hybrid infrastructure. You will design and build the core systems across compute, networking, and storage.
What You’ll Do
- Build and operate multi-cluster environments using Kubernetes
- Design infrastructure across cloud, on-prem, and virtualized environments
- Develop systems for cluster provisioning, workload isolation, and scaling
- Work across the stack: servers → networking → storage → containers → software
- Optimize performance for data-intensive and GPU workloads
Requirements
- Hands-on expertise with Kubernetes
- Experience with network configurations and Datacenter Designs (CCNP)
- Solid understanding of Linux, networking, and distributed systems
- Experience with Infrastructure as Code (e.g., Terraform)
- Experienced with at least one of the cloud platforms (AWS, GCP, or Azure)
- Good understanding of storage systems (distributed storage, NFS, object storage)
- Familiarity with server and hardware.
Nice to Have
- Experience with GPU environments (e.g., NVIDIA ecosystem)
- Knowledge of high-performance networking (e.g., InfiniBand)
- Familiarity with tools like Argo Workflows or Apache Airflow
- Experience with distributed storage (e.g., Ceph)
Soft Skills & Mindset
- Strong problem-solving ability and curiosity—comfortable working on ambiguous, unsolved challenges
- R&D mindset: you enjoy experimenting, prototyping, and iterating on new ideas
- Ability to think beyond tools and understand systems as a whole
- Ownership mentality: you take responsibility for reliability and outcomes, not just tasks
- Effective communication and collaboration across engineering teams
- Resilience under pressure, especially during incidents or system failures