~/companies/xAI/Member of Technical Staff - Infrastructure Reliability

Member of Technical Staff - Infrastructure Reliability

InfrastructurePalo Alto, CA

<div class="content-intro"><h3><strong><span style="font-family: arial, helvetica, sans-serif;">About xAI</span></strong></h3> <p><span style="font-family: arial, helvetica, sans-serif;">xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. </span><span style="font-family: arial, helvetica, sans-serif;">Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. </span><span style="font-family: arial, helvetica, sans-serif;">We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. </span><span style="font-family: arial, helvetica, sans-serif;">All employees are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.</span></p></div><h3><strong>ABOUT THE ROLE:</strong></h3> <p>We are training some of the largest models in the world on the latest hardware across multiple environments. To do this reliably at xAI’s pace, we need engineers who have battle-tested experience keeping massive distributed infrastructure up and running 24/7, including on-prem and cloud-based infrastructure. This is a joint xAI/X role: you will own 24×7 reliability for the world’s largest GPU training superclusters and one of the highest-QPS production systems on the planet (X).</p> <p>You will own the availability, performance, and evolution of xAI’s core compute, storage, and networking infrastructure. This is not an ops-only role — strong coding is a hard requirement. You will design, implement, and ship systems software, automation, and tooling in Python and/or Rust that directly impact training throughput and cluster utilization. You will be expected to participate in a team on-call rotation and to contribute to ushering xAI into the next generation of infrastructure management across multiple data centers and cloud environments.</p> <h3><strong>RESPONSIBILITIES:</strong></h3> <ul> <li>Define and execute the technical strategy for infrastructure reliability and scalability</li> <li>Build and maintain the automation, observability, and control planes that keep multi-datacenter, hybrid cloud/on-prem environments healthy</li> <li>Lead incident response, deep-dive root cause analysis, and post-mortems that drive real fixes</li> <li>Identify, instrument, and eliminate systemic failure patterns (capacity, network, hardware, storage, software)</li> <li>Design and implement high-leverage systems software (daemons, controllers, schedulers, etc.) in Python and Rust.</li> <li>Push the state of the art in large-scale GPU cluster operations and AI workload reliability</li> </ul> <h3><strong>BASIC QUALIFICATIONS:</strong></h3> <ul> <li>5+ years shipping production software and/or operating distributed infrastructure at scale</li> <li>Expert-level knowledge of Linux systems, TCP/IP networking, and systems programming</li> <li>Strong coding skills with proven production experience in Rust (strongly preferred) and at least one of Python, Go, or C++.</li> <li>Deep experience with large-scale distributed systems in on-prem and cloud environments (GCP experience a plus)</li> <li>Hands-on expertise with container orchestration (Kubernetes, Borg-class systems, or custom schedulers), container runtimes, and infrastructure-as-code (Puppet/Chef/Ansible/Terraform)</li> <li>Intimate understanding of common failure modes in distributed systems and how to mitigate them (blast radius control, failure domains, canaries, chaos engineering, etc.)</li> <li>Track record of participating in (or building) effective on-call rotations in high-stakes environments</li> <li>Bachelor’s degree in Computer Science, Electrical Engineering, or equivalent real-world experience</li> </ul> <h3><strong>PREFERRED SKILLS AND EXPERIENCE:</strong></h3> <ul> <li>Significant contributions to large-scale GPU clusters or AI/ML infrastructure</li> <li>Experience in on-call rotations and incident response in high-stakes environments.</li> <li>Strong problem-solving skills and ability to thrive in a fast-paced, ambiguous setting.</li> <li>Experience with high-performance networking (RDMA, RoCE, Infiniband) and low level configuration (eBPG, xdp, io_uring)</li> <li>Comfortable with deployment, support, monitoring, administration, and troubleshooting across on-prem, cloud and hybrid infrastructures.</li> </ul> <h3>COMPENSATION AND BENEFITS:</h3> <p>$180,000 - $400,000 USD</p> <p>Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.</p><div class="content-conclusion"><p><em>xAI is an equal opportunity employer. For details on data processing, view our </em><em><a href="https://x.ai/legal/recruitment-privacy-notice" target="_blank">Recruitment Privacy Notice</a>.</em></p></div>

[ apply → ]