Site Reliability Engineer
Engineers in this role maintain the reliability and performance of AI infrastructure at scale, spending their days on incident response, automation, and observability across distributed systems that power AI workloads. They differ from software engineers by focusing on operational excellence and system resilience rather than feature development, and from DevOps roles by owning broader platform-level reliability goals. These teams typically sit within infrastructure or platform organizations, partnering closely with product engineering teams to ensure AI services remain fast, secure, and always available across multiple regions.
Skills
What companies are looking for in this role.
Designing and operating multi-cloud infrastructure across multiple cloud providers with infrastructure-as-code principles
Building and maintaining comprehensive monitoring, logging, and alerting systems for production infrastructure
Managing and scaling Kubernetes clusters including lifecycle management, upgrades, networking, and resource orchestration
Leading incident response processes, conducting root cause analysis, and driving postmortem-driven improvements
Defining, implementing, and evolving Service Level Objectives (SLOs) and Service Level Indicators (SLIs)
Automating operational tasks and reducing toil through infrastructure automation and tooling development
Participating in on-call rotations and maintaining production system availability during operational emergencies
Designing and implementing CI/CD pipelines and deployment infrastructure for reliable application delivery
Operating containerized workloads and managing container runtimes in production environments
Implementing GitOps workflows and managing infrastructure through declarative configuration
Implementing security best practices including identity and access management, least-privilege principles, and compliance standards
Operating distributed databases and data systems at scale including configuration, performance tuning, and capacity planning
Managing disaster recovery strategies, backup procedures, and implementing recovery time objectives
Designing network infrastructure including load balancing, service mesh, and DNS management at scale
Conducting performance analysis and reliability testing of large-scale distributed systems
Optimizing infrastructure costs while maintaining reliability and performance standards
Implementing data pipeline reliability and managing high-throughput ingestion systems
Designing and managing multi-tenant isolation strategies for shared infrastructure platforms
Analyzing measurement data and system telemetry to support engineering decision-making
Managing GPU and accelerated computing infrastructure for high-performance workloads
Implementing observability for AI and machine learning workloads including training job reliability
Collaborating with software engineering teams to embed reliability principles into system design and deployment processes
Writing and maintaining runbooks, operational procedures, and documentation for production systems
Setting operational standards and quality expectations across engineering organizations
Evaluating, negotiating, and managing vendor relationships for third-party services and migrations
Technology
The tools and technologies that define this role.
Open Jobs
105 open Site Reliability Engineer jobs across 42 companies.
Other Engineering roles
General-purpose software engineering roles focused on building and maintaining software systems. Covers generalist SWE positions that don't clearly fall into frontend, backend, fullstack, or other specialized tracks.
Engineers focused on server-side systems, APIs, services, and data processing pipelines. Includes roles explicitly labeled as backend or server-side development.
Engineers specializing in user-facing interfaces, web applications, and client-side development. Includes UI/UX engineering and web development roles.
Engineers working across the entire application stack, handling both frontend and backend responsibilities.
Engineers building and maintaining internal platforms, cloud infrastructure, compute systems, and developer tooling.