We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI. Join us in shaping the future of AI and pushing the boundaries of what's possible in video generation.
As a Site Reliability Engineer (SRE) at Genmo, you will be responsible for designing, implementing, and maintaining the infrastructure that powers our large generative AI models. You will work on infrastructure automation, distributed systems design, and manage high-performance computing (HPC) and GPU clusters. The ideal candidate will have a strong background in infrastructure automation, distributed systems, and experience with GPU and HPC environments.
Design, implement, and maintain scalable infrastructure to support our generative AI models.
Develop and maintain infrastructure automation tools using technologies like Docker, Kubernetes, and Terraform.
Ensure the reliability, availability, and performance of our systems through proactive monitoring and incident response.
Collaborate with software engineers and researchers to design and implement distributed systems.
Manage and optimize GPU and HPC clusters for efficient AI model training and inference.
Develop and maintain CI/CD pipelines to streamline development and deployment processes.
Implement and maintain security best practices across the infrastructure.
5+ years of experience in site reliability engineering or a similar role.
Experience working in a 24 x 7 enterprise environment
Hands-on experience with infrastructure as code and automation tools (Ansible, Chef, Puppet, Terraform)
Strong experience with infrastructure automation tools such as Docker, Kubernetes, and Terraform.
Expertise in designing and maintaining distributed systems.
Proficiency in scripting and programming languages, particularly Python and C++.
Strong understanding of networking, security, and system performance.
Excellent problem-solving skills and the ability to work in a fast-paced environment.
Experience with cloud providers like AWS, GCP, or Azure.
Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
Familiarity with CI/CD tools and practices (e.g., Jenkins, GitLab CI/CD).
Experience working with AI and machine learning models.
Strong passion for artificial intelligence and the drive to learn new technologies.
Genmo is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Genmo, Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish.