Applied Methods
~The MetaEngineeringSite Reliability Engineer

Site Reliability Engineer

Engineers in this role maintain the reliability and performance of AI infrastructure at scale, spending their days on incident response, automation, and observability across distributed systems that power AI workloads. They differ from software engineers by focusing on operational excellence and system resilience rather than feature development, and from DevOps roles by owning broader platform-level reliability goals. These teams typically sit within infrastructure or platform organizations, partnering closely with product engineering teams to ensure AI services remain fast, secure, and always available across multiple regions.

$ titles --canonical
Site Reliability EngineerSenior SREStaff SREProduction EngineerReliability EngineerInfrastructure SRE
Open Jobs105
Companies Hiring42
$02

Skills

What companies are looking for in this role.

$ skills --core

Designing and operating multi-cloud infrastructure across multiple cloud providers with infrastructure-as-code principles

95%

Building and maintaining comprehensive monitoring, logging, and alerting systems for production infrastructure

93%

Managing and scaling Kubernetes clusters including lifecycle management, upgrades, networking, and resource orchestration

92%

Leading incident response processes, conducting root cause analysis, and driving postmortem-driven improvements

91%

Defining, implementing, and evolving Service Level Objectives (SLOs) and Service Level Indicators (SLIs)

90%

Automating operational tasks and reducing toil through infrastructure automation and tooling development

89%

Participating in on-call rotations and maintaining production system availability during operational emergencies

88%

Designing and implementing CI/CD pipelines and deployment infrastructure for reliable application delivery

88%

Operating containerized workloads and managing container runtimes in production environments

85%

Implementing GitOps workflows and managing infrastructure through declarative configuration

82%

Implementing security best practices including identity and access management, least-privilege principles, and compliance standards

80%

Operating distributed databases and data systems at scale including configuration, performance tuning, and capacity planning

78%

Managing disaster recovery strategies, backup procedures, and implementing recovery time objectives

78%

Designing network infrastructure including load balancing, service mesh, and DNS management at scale

76%

Conducting performance analysis and reliability testing of large-scale distributed systems

75%

Optimizing infrastructure costs while maintaining reliability and performance standards

70%

Implementing data pipeline reliability and managing high-throughput ingestion systems

68%

Designing and managing multi-tenant isolation strategies for shared infrastructure platforms

68%
$ skills --emerging

Analyzing measurement data and system telemetry to support engineering decision-making

72%

Managing GPU and accelerated computing infrastructure for high-performance workloads

65%

Implementing observability for AI and machine learning workloads including training job reliability

62%
$ skills --soft

Collaborating with software engineering teams to embed reliability principles into system design and deployment processes

85%

Writing and maintaining runbooks, operational procedures, and documentation for production systems

80%

Setting operational standards and quality expectations across engineering organizations

72%

Evaluating, negotiating, and managing vendor relationships for third-party services and migrations

65%
$03

Technology

The tools and technologies that define this role.

$ tech --language
Bashhigh
Pythonhigh
Gomoderate
Javalow
$ tech --framework
OpenTelemetrymoderate
$ tech --platform
AWSvery high
Kubernetesvery high
Linuxvery high
Azurehigh
Dockerhigh
Google Cloud Platformhigh
ClickHousemoderate
Kafkamoderate
Cloudflare Workerslow
MongoDB Atlaslow
Oktalow
Snowflakelow
$ tech --tool
Terraformvery high
Ansiblehigh
Grafanahigh
Prometheushigh
Alertmanagermoderate
ArgoCDmoderate
Datadogmoderate
GitHub Actionsmoderate
Helmmoderate
Jenkinsmoderate
Lokimoderate
PagerDutymoderate
pytestmoderate
Argo Workflowslow
Coralogixlow
Crossplanelow
Envoylow
Falcolow
FluxCDlow
Opallow
Sentrylow
Thanoslow
VictoriaMetricslow
Wizlow
$ tech --concept
eBPFlow
XDPlow
$04

Open Jobs

105 open Site Reliability Engineer jobs across 42 companies.

Gong2d
Senior DevOps Engineer
Tel Aviv·Engineering
Crusoe3d
Senior Production Engineer, Operational Excellence
San Francisco, CA - US·Engineering
Lovable4d
Runtime Engineer - Lovable Apps Platform Team
Stockholm·Engineering
Graphcore6d
Senior Systems Engineer – Performance & Reliability (Analysis)
London, UK·Engineering
Graphcore6d
Senior Systems Engineer – Performance & Reliability (Analysis)
Bristol, UK·Engineering
Graphcore6d
Senior Systems Engineer – Performance & Reliability
Gdańsk, Pomeranian Voivodeship, Poland·Engineering
Graphcore6d
Senior Systems Engineer – Performance & Reliability (Analysis)
Gdańsk, Pomeranian Voivodeship, Poland·Engineering
Cognition6d
Site Reliability Engineer
San Francisco·Engineering
Thinking Machines Lab1w
Site Reliability Engineer (SRE)
San Francisco·Engineering
OpenAI1w
Site Reliability Engineer, Infrastructure - Analytics Platform
San Francisco·Engineering
Together AI1w
AI infrastructure Engineer (SRE) Amsterdam
Amsterdam·Engineering
RunPod2w
Site Reliability Engineer
Remote, USA·Engineering
CoreWeave2w
Senior Production Engineer
Singapore·Engineering
fal2w
Software Engineer, Site Reliability
San Francisco·Engineering
Databricks2w
Sr Staff Production Engineer- Public Sector
Virginia·Engineering
Databricks2w
Staff Production Engineer- Public Sector
Virginia·Engineering
Databricks2w
Sr Production Engineer- Public Sector
Virginia·Engineering
CoreWeave2w
Production Engineer – Team Lead
Singapore·Engineering
MongoDB2w
Site Reliability Engineer 3
New York City·Engineering
MongoDB2w
Site Reliability Engineer 3
Dublin·Engineering