~/companies/xAI/Member of Technical Staff - Infrastructure Reliability
Member of Technical Staff - Infrastructure Reliability
InfrastructurePalo Alto, CA
<div class="content-intro"><h3><strong><span style="font-family: arial, helvetica, sans-serif;">About xAI</span></strong></h3>
<p><span style="font-family: arial, helvetica, sans-serif;">xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. </span><span style="font-family: arial, helvetica, sans-serif;">Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. </span><span style="font-family: arial, helvetica, sans-serif;">We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. </span><span style="font-family: arial, helvetica, sans-serif;">All employees are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.</span></p></div><h3><strong>ABOUT THE ROLE:</strong></h3>
<p>We are training some of the largest models in the world on the latest hardware across multiple environments. To do this reliably at xAI’s pace, we need engineers who have battle-tested experience keeping massive distributed infrastructure up and running 24/7, including on-prem and cloud-based infrastructure. This is a joint xAI/X role: you will own 24×7 reliability for the world’s largest GPU training superclusters and one of the highest-QPS production systems on the planet (X).</p>
<p>You will own the availability, performance, and evolution of xAI’s core compute, storage, and networking infrastructure. This is not an ops-only role — strong coding is a hard requirement. You will design, implement, and ship systems software, automation, and tooling in Python and/or Rust that directly impact training throughput and cluster utilization. You will be expected to participate in a team on-call rotation and to contribute to ushering xAI into the next generation of infrastructure management across multiple data centers and cloud environments.</p>
<h3><strong>RESPONSIBILITIES:</strong></h3>
<ul>
<li>Define and execute the technical strategy for infrastructure reliability and scalability</li>
<li>Build and maintain the automation, observability, and control planes that keep multi-datacenter, hybrid cloud/on-prem environments healthy</li>
<li>Lead incident response, deep-dive root cause analysis, and post-mortems that drive real fixes</li>
<li>Identify, instrument, and eliminate systemic failure patterns (capacity, network, hardware, storage, software)</li>
<li>Design and implement high-leverage systems software (daemons, controllers, schedulers, etc.) in Python and Rust.</li>
<li>Push the state of the art in large-scale GPU cluster operations and AI workload reliability</li>
</ul>
<h3><strong>BASIC QUALIFICATIONS:</strong></h3>
<ul>
<li>5+ years shipping production software and/or operating distributed infrastructure at scale</li>
<li>Expert-level knowledge of Linux systems, TCP/IP networking, and systems programming</li>
<li>Strong coding skills with proven production experience in Rust (strongly preferred) and at least one of Python, Go, or C++.</li>
<li>Deep experience with large-scale distributed systems in on-prem and cloud environments (GCP experience a plus)</li>
<li>Hands-on expertise with container orchestration (Kubernetes, Borg-class systems, or custom schedulers), container runtimes, and infrastructure-as-code (Puppet/Chef/Ansible/Terraform)</li>
<li>Intimate understanding of common failure modes in distributed systems and how to mitigate them (blast radius control, failure domains, canaries, chaos engineering, etc.)</li>
<li>Track record of participating in (or building) effective on-call rotations in high-stakes environments</li>
<li>Bachelor’s degree in Computer Science, Electrical Engineering, or equivalent real-world experience</li>
</ul>
<h3><strong>PREFERRED SKILLS AND EXPERIENCE:</strong></h3>
<ul>
<li>Significant contributions to large-scale GPU clusters or AI/ML infrastructure</li>
<li>Experience in on-call rotations and incident response in high-stakes environments.</li>
<li>Strong problem-solving skills and ability to thrive in a fast-paced, ambiguous setting.</li>
<li>Experience with high-performance networking (RDMA, RoCE, Infiniband) and low level configuration (eBPG, xdp, io_uring)</li>
<li>Comfortable with deployment, support, monitoring, administration, and troubleshooting across on-prem, cloud and hybrid infrastructures.</li>
</ul>
<h3>COMPENSATION AND BENEFITS:</h3>
<p>$180,000 - $400,000 USD</p>
<p>Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.</p><div class="content-conclusion"><p><em>xAI is an equal opportunity employer. For details on data processing, view our </em><em><a href="https://x.ai/legal/recruitment-privacy-notice" target="_blank">Recruitment Privacy Notice</a>.</em></p></div>