Site Reliability Engineer
Engineers in this role maintain the reliability and performance of AI infrastructure at scale, spending their days on incident response, automation, and observability across distributed systems that power AI workloads. They differ from software engineers by focusing on operational excellence and system resilience rather than feature development, and from DevOps roles by owning broader platform-level reliability goals. These teams typically sit within infrastructure or platform organizations, partnering closely with product engineering teams to ensure AI services remain fast, secure, and always available across multiple regions.
Skills
What companies are looking for in this role.
Designing and implementing monitoring, alerting, and observability systems across distributed infrastructure
Managing incident response processes including root cause analysis and postmortem facilitation
Automating operational tasks and building infrastructure-as-code deployment solutions
Defining, implementing, and tracking Service Level Objectives (SLOs) and Service Level Indicators (SLIs)
Troubleshooting and debugging production issues across complex system stacks
Operating and maintaining stateful storage and database systems at scale
Managing multi-cloud and multi-region infrastructure deployment and operations
Building CI/CD pipelines and managing deployment processes for reliable releases
Reducing Mean Time To Recovery (MTTR) through tooling, runbooks, and automation
Optimizing system performance, architecture, and scaling for maximum uptime and minimal latency
Participating in on-call rotations and maintaining incident escalation paths
Capacity planning and resource optimization for infrastructure scaling
Understanding Linux operating system internals, networking concepts, and system-level optimization
Designing self-healing and resilient systems that respond automatically to failure scenarios
Leading production readiness reviews and reliability standards enforcement across teams
Applying AI and machine learning techniques to improve incident detection and operational efficiency
Building predictive maintenance and anomaly detection systems for infrastructure health
Maintaining observability for machine learning model-serving workloads and inference infrastructure
Developing agentic tooling and AI-driven automation for operational workflows
Conducting thorough, blameless postmortems and driving preventative improvements
Collaborating with product and engineering teams on architectural reliability improvements
Building developer tooling and empowering developer productivity through infrastructure improvements
Mentoring engineering teams and establishing reliability as a core organizational value
Communicating complex technical concepts and driving technical decision-making across stakeholders
Balancing long-term infrastructure strategic goals with immediate engineering needs
Technology
The tools and technologies that define this role.
Open Jobs
111 open Site Reliability Engineer jobs across 43 companies.
Other Engineering roles
General-purpose software engineering roles focused on building and maintaining software systems. Covers generalist SWE positions that don't clearly fall into frontend, backend, fullstack, or other specialized tracks.
Engineers focused on server-side systems, APIs, services, and data processing pipelines. Includes roles explicitly labeled as backend or server-side development.
Engineers specializing in user-facing interfaces, web applications, and client-side development. Includes UI/UX engineering and web development roles.
Engineers working across the entire application stack, handling both frontend and backend responsibilities.
Engineers building and maintaining internal platforms, cloud infrastructure, compute systems, and developer tooling.