Mistral AI is looking for a Site Reliability Engineer (SRE) to shape reliability, scalability, and performance of our platform and customer facing applications. You will work closely with our software engineers to ensure our systems meet and exceed our customers' expectations.
Responsibilities
- Make sure our inference and platform resources are always available and in good shape
- Ensure our products are reliable and ensure SLAs
- Design, build, and maintain scalable, highly available, and fault-tolerant standard and AI infrastructure to support our machine learning workloads and services
- Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime
- Develop and maintain comprehensive documentation for infrastructure designs, processes, and best practices
- Participate in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences
- Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform, …
- Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements
- Evaluate and implement new tools, technologies, and processes to enhance our AI infrastructure's efficiency, reliability, and scalability
About you :
- 5+ years of experience in SW Engineering
- Key technical skills: observability/alerting/operational maintenance
- Familiar with bare Kubernetes/Grafana/Prometheus
- Experience building cross datacenter & highly available distributed systems
- Experience profiling & optimizing stacks to the millisecond
- Good programming skills in one language (Python/Go/C++/Rust)
- Master’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
- Proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role, ideally in an AI/ML-focused environment.
- Strong understanding of AI/ML infrastructure requirements
- Experience with containerization and orchestration technologies like Docker and Kubernetes.
- Familiarity with infrastructure-as-code tools such as Terraform
- Solid understanding of cloud computing platforms like AWS, GCP, or Azure.
- Experience with monitoring, logging, and alerting tools like Prometheus, Grafana, ELK Stack, …
- Strong problem-solving skills and the ability to work independently and collaboratively in a fast-paced environment.
- Excellent communication skills, both written and verbal.
What We Offer:
- Ability to shape the exciting journey of AI and be part of the very early days of one of Europe’s hottest startup
- A fun, young, multicultural team and collaborative work environment — based in Paris and London
- Competitive salary and bonus structure
- Comprehensive benefits package
- Opportunities for professional growth and development
We're a small team, composed of seasoned researchers and engineers in the AI field. We like to work hard and be at the edge of science. We are creative, low-ego, team-spirited, and have been passionate about AI for years. We hire people that foster in competitive environments, because they find them more fun to work in. We hire passionate women and men from all over the world.
Developers are using our API via la Plateforme to build incredible AI-first applications powered by our models that can understand and generate natural language text and code. We are multilingual at our core. More recently, we released le Chat, as a demonstrator of our models.