Infrastructure & Platform Engineer
Engineers in this role architect and operate the systems that power AI research and product development at scale. They design distributed infrastructure for training, serving, and orchestrating AI workloads across GPU clusters, build internal platforms that accelerate developer velocity, and optimize the critical path from code to production. This role bridges deep systems engineering expertise—in areas like Kubernetes, build systems, data pipelines, and performance tuning—with the unique demands of AI workloads, combining hands-on infrastructure work with close collaboration with researchers and product teams to eliminate bottlenecks that slow down innovation.
Skills
What companies are looking for in this role.
Designing and deploying cloud-based machine learning training and inference clusters at scale
Designing and operating Kubernetes clusters, including schedulers, control planes, and custom controllers for specialized workloads
Implementing Infrastructure as Code for reproducible resource provisioning and configuration management
Building and maintaining CI/CD pipelines for machine learning workflows and distributed systems
Optimizing system performance including GPU utilization, latency, and throughput at scale
Diagnosing and resolving distributed systems issues including performance bottlenecks and hardware failures
Managing and optimizing network-based distributed file systems and blob storage solutions for machine learning workloads
Designing and building tools for monitoring, observability, and operational visibility across infrastructure
Provisioning bare metal servers and managing hardware lifecycle across data centers and edge environments
Developing custom autoscaling solutions for machine learning and compute-intensive workloads
Implementing security best practices across infrastructure stacks without impeding research velocity
Building abstractions and developer-friendly tools that accelerate research iteration and reduce infrastructure friction
Architecting multi-region and multi-cloud infrastructure for distributed training and inference
Designing systems for measuring and evaluating large-scale machine learning workloads to determine production readiness
Integrating artificial intelligence capabilities into developer workflows and productivity tools
Collaborating with research and product teams to translate workload requirements into infrastructure solutions
Owning technical strategy, roadmaps, and long-term architectural decisions for infrastructure systems
Taking ownership of production systems and participating in incident diagnosis and resolution
Communicating complex technical concepts across teams with different expertise and priorities
Mentoring engineers and establishing best practices for building and operating large-scale systems
Technology
The tools and technologies that define this role.
Open Jobs
531 open Infrastructure & Platform Engineer jobs across 81 companies.
Other Engineering roles
General-purpose software engineering roles focused on building and maintaining software systems. Covers generalist SWE positions that don't clearly fall into frontend, backend, fullstack, or other specialized tracks.
Engineers focused on server-side systems, APIs, services, and data processing pipelines. Includes roles explicitly labeled as backend or server-side development.
Engineers specializing in user-facing interfaces, web applications, and client-side development. Includes UI/UX engineering and web development roles.
Engineers working across the entire application stack, handling both frontend and backend responsibilities.
Engineers embedded with customers or deployed on-site to solve domain-specific technical problems. Combines engineering skills with direct client interaction.