Join 1,800+ DevOps engineers getting weekly alerts for remote and US, EU roles that don't show up on the big boards. Junior to senior. Kubernetes, AWS, Terraform — filtered for your stack.
🇪🇺 Secure Your EU Traffic
Ensure digital sovereignty for your infrastructure. Get EU static IPs with
full data residency for compliance and peace of mind.
Research Services👥 201 employees📍 San Francisco, CA, USEst. 2015
OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. AI is an extremely powerful tool that must be created with…
Job Overview
OpenAI is seeking a Site Reliability Engineer to join the Frontier Systems Infrastructure team. This role involves operating the next generation of compute clusters that power OpenAI’s frontier research, blending distributed systems engineering with hands-on infrastructure work on our largest datacenters.
Key Responsibilities
Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and cluster lifecycle management
Build software abstractions that unify multiple clusters and present a seamless interface to training workloads
Own node bring-up from bare metal through firmware upgrades, ensuring fast, repeatable deployment at massive scale
Improve operational metrics such as reducing cluster restart times and accelerating firmware or OS upgrade cycles
Integrate networking and hardware health systems to deliver end-to-end reliability across servers, switches, and data center infrastructure
Develop monitoring and observability systems to detect issues early and keep clusters stable under extreme load
Required Qualifications
Experience as an infrastructure, systems, or distributed systems engineer in large-scale or high-availability environments
Strong knowledge of Kubernetes internals, cluster scaling patterns, and containerized workloads
Proficiency in cloud infrastructure concepts (compute, networking, storage, security) and in automating cluster or data center operations
Bonus: background with GPU workloads, firmware management, or high-performance computing
About OpenAI
OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products.
Benefits
Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year
Mental health and wellness support
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible
Relocation support for eligible employees
Additional taxable fringe benefits, such as charitable donation matching and wellness stipends