استخدام Program Manager
شرح موقعیت شغلی
Required Profile: Technical Program Manager (AI Infrastructure Platform)
We are looking for a Technical Program Manager to lead the execution of a complex AI and GPU infrastructure platform. This role requires strong technical understanding and the ability to coordinate multiple engineering teams.
Key Areas of Responsibility:
Manage execution of a Kubernetes-based platform with multi-cluster architecture (cluster-of-clusters), focusing on scalability and system upgradesCoordinate multi-cloud infrastructure development (AWS in MVP, expanding to other cloud providers) across full resource lifecycle (provisioning, scaling, teardown)Oversee GPU infrastructure (NVIDIA / AMD) with understanding of topology-aware scheduling and performance optimizationDrive scheduling systems for workloads (SLURM in MVP, future evolution with Volcano and Kueue)Ensure observability across the platform using OpenTelemetry, Prometheus, and GrafanaManage IAM and security integration (RBAC, SSO, LDAP) in collaboration with security teamsCoordinate development of LLM-based conversational interfaces and API-driven, event-based systemsOversee infrastructure pipelines, container registries, and GPU/CPU health and benchmarking systemsDefine and track system performance metrics (benchmarks, telemetry, health checks)
Required Skills:
Proven experience managing complex technical programs (Cloud / AI / Distributed Systems)Strong understanding of Kubernetes, DevOps, and scalable system architectureFamiliarity with GPU computing or HPC environments is a strong advantageAbility to coordinate multiple engineering teams and manage cross-layer dependenciesStrong planning, documentation, and risk management skills
Employment type: Full-time (Remote possible)
We are looking for a Technical Program Manager to lead the execution of a complex AI and GPU infrastructure platform. This role requires strong technical understanding and the ability to coordinate multiple engineering teams.
Key Areas of Responsibility:
Manage execution of a Kubernetes-based platform with multi-cluster architecture (cluster-of-clusters), focusing on scalability and system upgradesCoordinate multi-cloud infrastructure development (AWS in MVP, expanding to other cloud providers) across full resource lifecycle (provisioning, scaling, teardown)Oversee GPU infrastructure (NVIDIA / AMD) with understanding of topology-aware scheduling and performance optimizationDrive scheduling systems for workloads (SLURM in MVP, future evolution with Volcano and Kueue)Ensure observability across the platform using OpenTelemetry, Prometheus, and GrafanaManage IAM and security integration (RBAC, SSO, LDAP) in collaboration with security teamsCoordinate development of LLM-based conversational interfaces and API-driven, event-based systemsOversee infrastructure pipelines, container registries, and GPU/CPU health and benchmarking systemsDefine and track system performance metrics (benchmarks, telemetry, health checks)
Required Skills:
Proven experience managing complex technical programs (Cloud / AI / Distributed Systems)Strong understanding of Kubernetes, DevOps, and scalable system architectureFamiliarity with GPU computing or HPC environments is a strong advantageAbility to coordinate multiple engineering teams and manage cross-layer dependenciesStrong planning, documentation, and risk management skills
Employment type: Full-time (Remote possible)
مهارتهای مورد نیاز
- مدیریت پروژه
- kubernetes
- DevOps
حداقل سابقه کار
- مهم نیست
جنسیت
- مهم نیست
وضعیت نظام وظیفه
- مهم نیست