~/The Meta/Engineering/Site Reliability Engineer_

Site Reliability Engineer

Engineering

Engineers in this role maintain the reliability and performance of AI infrastructure at scale, spending their days on incident response, automation, and observability across distributed systems that power AI workloads. They differ from software engineers by focusing on operational excellence and system resilience rather than feature development, and from DevOps roles by owning broader platform-level reliability goals. These teams typically sit within infrastructure or platform organizations, partnering closely with product engineering teams to ensure AI services remain fast, secure, and always available across multiple regions.

$ titles --canonical

Site Reliability EngineerSenior SREStaff SREProduction EngineerReliability EngineerInfrastructure SRE

Open Jobs111

Companies Hiring43

$02_

Skills

What companies are looking for in this role.

$ skills --core

Designing and implementing monitoring, alerting, and observability systems across distributed infrastructure

95%

Managing incident response processes including root cause analysis and postmortem facilitation

95%

Automating operational tasks and building infrastructure-as-code deployment solutions

95%

Defining, implementing, and tracking Service Level Objectives (SLOs) and Service Level Indicators (SLIs)

90%

Troubleshooting and debugging production issues across complex system stacks

85%

Operating and maintaining stateful storage and database systems at scale

85%

Managing multi-cloud and multi-region infrastructure deployment and operations

85%

Building CI/CD pipelines and managing deployment processes for reliable releases

85%

Reducing Mean Time To Recovery (MTTR) through tooling, runbooks, and automation

80%

Optimizing system performance, architecture, and scaling for maximum uptime and minimal latency

80%

Participating in on-call rotations and maintaining incident escalation paths

80%

Capacity planning and resource optimization for infrastructure scaling

80%

Understanding Linux operating system internals, networking concepts, and system-level optimization

75%

Designing self-healing and resilient systems that respond automatically to failure scenarios

75%

Leading production readiness reviews and reliability standards enforcement across teams

75%

$ skills --emerging

Applying AI and machine learning techniques to improve incident detection and operational efficiency

70%

Building predictive maintenance and anomaly detection systems for infrastructure health

65%

Maintaining observability for machine learning model-serving workloads and inference infrastructure

60%

Developing agentic tooling and AI-driven automation for operational workflows

60%

$ skills --soft

Conducting thorough, blameless postmortems and driving preventative improvements

80%

Collaborating with product and engineering teams on architectural reliability improvements

80%

Building developer tooling and empowering developer productivity through infrastructure improvements

70%

Mentoring engineering teams and establishing reliability as a core organizational value

70%

Communicating complex technical concepts and driving technical decision-making across stakeholders

70%

Balancing long-term infrastructure strategic goals with immediate engineering needs

65%

$03_

Technology

The tools and technologies that define this role.

$ tech --language

Gohigh

Pythonhigh

Javamoderate

$ tech --framework

CUDAlow

$ tech --platform

AWSvery high

Kubernetesvery high

Linuxvery high

Azurehigh

Dockerhigh

Google Cloud Platform (GCP)high

NVIDIA GPUmoderate

ELK Stacklow

MongoDB Atlaslow

$ tech --tool

Terraformvery high

Datadoghigh

Ansiblemoderate

ArgoCDmoderate

GitHub Actionsmoderate

Grafanamoderate

Helmmoderate

Jenkinsmoderate

Pagerdutymoderate

Prometheusmoderate

Pulumimoderate

Argo Workflowslow

Coralogixlow

Crossplanelow

Sentrylow

Wizlow

$ tech --concept

Distributed systemshigh

DNSmoderate

GitOpsmoderate

Machine Learning infrastructuremoderate

TCP/IPmoderate

TLS/SSLmoderate

$04_

Open Jobs

111 open Site Reliability Engineer jobs across 43 companies.

Crusoe

Production Engineer (Kubernetes)

Engineering

Dublin - IE

Crusoe1w

Production Engineer (Kubernetes)

Dublin - IE·Engineering

Sr Technical Solutions Engineering

Engineering

McLean, Virginia

Databricks1w

Sr Technical Solutions Engineering

McLean, Virginia·Engineering

Databricks

Staff Technical Solution Engineering

Engineering

McLean, Virginia

Databricks1w

Staff Technical Solution Engineering

McLean, Virginia·Engineering

Nscale

Principal Site Reliability Engineer - AI Infrastructure Operations

Engineering

Houston; New York; San Francisco; Seattle

Nscale1w

Principal Site Reliability Engineer - AI Infrastructure Operations

Houston; New York; San Francisco; Seattle·Engineering

Gong

Senior DevOps Engineer

Engineering

Tel Aviv

Gong1w

Senior DevOps Engineer

Tel Aviv·Engineering

Nebius

Senior Site Reliability Engineer (DevTools)

Engineering

Amsterdam, Netherlands; Germany; Israel; London, United Kingdom; Remote - Europe; United Kingdom

Nebius2w

Senior Site Reliability Engineer (DevTools)

Amsterdam, Netherlands; Germany; Israel; London, United Kingdom; Remote - Europe; United Kingdom·Engineering

Crusoe

Staff Network Engineer, Operations

Engineering

San Francisco, CA - US

Crusoe2w

Staff Network Engineer, Operations

San Francisco, CA - US·Engineering

Lambda

Senior Incident Manager

Engineering

Remote, USA

Lambda2w

Senior Incident Manager

Remote, USA·Engineering

Vapi

Member of Technical Staff, Site Reliablity Engineer

Engineering

San Francisco

Vapi2w

Member of Technical Staff, Site Reliablity Engineer

San Francisco·Engineering

Crusoe

Senior Staff Network Engineer, Operations

Engineering

San Francisco, CA - US

Crusoe3w

Senior Staff Network Engineer, Operations

San Francisco, CA - US·Engineering

Nscale

Site Reliability Engineer

Engineering

Nscale3w

Site Reliability Engineer

US·Engineering

Crusoe

Senior Production Engineer

Engineering

San Francisco, CA - US

Crusoe3w

Senior Production Engineer

San Francisco, CA - US·Engineering

Nectar Social

Senior Site Reliability Engineer

Engineering

Palo Alto

Nectar Social3w

Senior Site Reliability Engineer

Palo Alto·Engineering

Mistral AI

Applied AI Engineer, Site Reliability Engineer - EMEA

Engineering

Paris

Mistral AI3w

Applied AI Engineer, Site Reliability Engineer - EMEA

Paris·Engineering

1mo

Block

Senior Site Reliability Engineer

Engineering

Melbourne, Australia

Block1mo

Senior Site Reliability Engineer

Melbourne, Australia·Engineering

1mo

Waymo

Ridehailing, Site Reliability Engineer

Engineering

Warsaw, Masovian Voivodeship, Poland

Waymo1mo

Ridehailing, Site Reliability Engineer

Warsaw, Masovian Voivodeship, Poland·Engineering

1mo

RunPod

Site Reliability Engineer

Engineering

Remote, USA

RunPod1mo

Site Reliability Engineer

Remote, USA·Engineering

1mo

Cresta

Infrastructure Engineer/SRE

Engineering

Taiwan (Remote)

Cresta1mo

Infrastructure Engineer/SRE

Taiwan (Remote)·Engineering

1mo

Cresta

Senior Infrastructure Engineer/SRE

Engineering

United States (Remote)

Cresta1mo

Senior Infrastructure Engineer/SRE

United States (Remote)·Engineering

View all 111 jobs

$ roles --related --function=engineering

Other Engineering roles

Software Engineer

General-purpose software engineering roles focused on building and maintaining software systems. Covers generalist SWE positions that don't clearly fall into frontend, backend, fullstack, or other specialized tracks.

Backend Engineer

Engineers focused on server-side systems, APIs, services, and data processing pipelines. Includes roles explicitly labeled as backend or server-side development.

Frontend Engineer

Engineers specializing in user-facing interfaces, web applications, and client-side development. Includes UI/UX engineering and web development roles.

Fullstack Engineer

Engineers working across the entire application stack, handling both frontend and backend responsibilities.

Infrastructure & Platform Engineer

Engineers building and maintaining internal platforms, cloud infrastructure, compute systems, and developer tooling.