Applied Methods
~The MetaEngineeringSite Reliability Engineer

Site Reliability Engineer

Engineers in this role maintain the reliability and performance of AI infrastructure at scale, spending their days on incident response, automation, and observability across distributed systems that power AI workloads. They differ from software engineers by focusing on operational excellence and system resilience rather than feature development, and from DevOps roles by owning broader platform-level reliability goals. These teams typically sit within infrastructure or platform organizations, partnering closely with product engineering teams to ensure AI services remain fast, secure, and always available across multiple regions.

$ titles --canonical
Site Reliability EngineerSenior SREStaff SREProduction EngineerReliability EngineerInfrastructure SRE
Open Jobs111
Companies Hiring43
$02

Skills

What companies are looking for in this role.

$ skills --core

Designing and implementing monitoring, alerting, and observability systems across distributed infrastructure

95%

Managing incident response processes including root cause analysis and postmortem facilitation

95%

Automating operational tasks and building infrastructure-as-code deployment solutions

95%

Defining, implementing, and tracking Service Level Objectives (SLOs) and Service Level Indicators (SLIs)

90%

Troubleshooting and debugging production issues across complex system stacks

85%

Operating and maintaining stateful storage and database systems at scale

85%

Managing multi-cloud and multi-region infrastructure deployment and operations

85%

Building CI/CD pipelines and managing deployment processes for reliable releases

85%

Reducing Mean Time To Recovery (MTTR) through tooling, runbooks, and automation

80%

Optimizing system performance, architecture, and scaling for maximum uptime and minimal latency

80%

Participating in on-call rotations and maintaining incident escalation paths

80%

Capacity planning and resource optimization for infrastructure scaling

80%

Understanding Linux operating system internals, networking concepts, and system-level optimization

75%

Designing self-healing and resilient systems that respond automatically to failure scenarios

75%

Leading production readiness reviews and reliability standards enforcement across teams

75%
$ skills --emerging

Applying AI and machine learning techniques to improve incident detection and operational efficiency

70%

Building predictive maintenance and anomaly detection systems for infrastructure health

65%

Maintaining observability for machine learning model-serving workloads and inference infrastructure

60%

Developing agentic tooling and AI-driven automation for operational workflows

60%
$ skills --soft

Conducting thorough, blameless postmortems and driving preventative improvements

80%

Collaborating with product and engineering teams on architectural reliability improvements

80%

Building developer tooling and empowering developer productivity through infrastructure improvements

70%

Mentoring engineering teams and establishing reliability as a core organizational value

70%

Communicating complex technical concepts and driving technical decision-making across stakeholders

70%

Balancing long-term infrastructure strategic goals with immediate engineering needs

65%
$03

Technology

The tools and technologies that define this role.

$ tech --language
Gohigh
Pythonhigh
Javamoderate
$ tech --framework
CUDAlow
$ tech --platform
AWSvery high
Kubernetesvery high
Linuxvery high
Azurehigh
Dockerhigh
Google Cloud Platform (GCP)high
NVIDIA GPUmoderate
ELK Stacklow
MongoDB Atlaslow
$ tech --tool
Terraformvery high
Datadoghigh
Ansiblemoderate
ArgoCDmoderate
GitHub Actionsmoderate
Grafanamoderate
Helmmoderate
Jenkinsmoderate
Pagerdutymoderate
Prometheusmoderate
Pulumimoderate
Argo Workflowslow
Coralogixlow
Crossplanelow
Sentrylow
Wizlow
$ tech --concept
Distributed systemshigh
DNSmoderate
GitOpsmoderate
Machine Learning infrastructuremoderate
TCP/IPmoderate
TLS/SSLmoderate
$04

Open Jobs

111 open Site Reliability Engineer jobs across 43 companies.

Crusoe1w
Production Engineer (Kubernetes)
Dublin - IE·Engineering
Gong1w
Senior SRE Engineer
Tel Aviv·Engineering
Databricks1w
Sr Technical Solutions Engineering
McLean, Virginia·Engineering
Databricks1w
Staff Technical Solution Engineering
McLean, Virginia·Engineering
Nscale1w
Principal Site Reliability Engineer - AI Infrastructure Operations
Houston; New York; San Francisco; Seattle·Engineering
Gong1w
Senior DevOps Engineer
Tel Aviv·Engineering
Nebius2w
Senior Site Reliability Engineer (DevTools)
Amsterdam, Netherlands; Germany; Israel; London, United Kingdom; Remote - Europe; United Kingdom·Engineering
Crusoe2w
Staff Network Engineer, Operations
San Francisco, CA - US·Engineering
Lambda2w
Senior Incident Manager
Remote, USA·Engineering
Vapi2w
Member of Technical Staff, Site Reliablity Engineer
San Francisco·Engineering
Crusoe3w
Senior Staff Network Engineer, Operations
San Francisco, CA - US·Engineering
Nscale3w
Site Reliability Engineer
US·Engineering
Crusoe3w
Senior Production Engineer
San Francisco, CA - US·Engineering
Nectar Social3w
Senior Site Reliability Engineer
Palo Alto·Engineering
Mistral AI3w
Applied AI Engineer, Site Reliability Engineer - EMEA
Paris·Engineering
Block1mo
Senior Site Reliability Engineer
Melbourne, Australia·Engineering
Waymo1mo
Ridehailing, Site Reliability Engineer
Warsaw, Masovian Voivodeship, Poland·Engineering
RunPod1mo
Site Reliability Engineer
Remote, USA·Engineering
Cresta1mo
Infrastructure Engineer/SRE
Taiwan (Remote)·Engineering
Cresta1mo
Senior Infrastructure Engineer/SRE
United States (Remote)·Engineering