We are looking for a Site Reliability Engineer to join our Engineering team. As an SRE, you will play a crucial role in ensuring the reliability, availability, and performance of our systems and services. You will collaborate, design, implement, and maintain infrastructure and automation solutions, supporting the continuous improvement of our platform's reliability and scalability.
What you will do:
Work across teams to ensure software is developed and deployed for maximum reliability
Develop, run and improve processes and tools
Build automation to support reliability efforts for all of our production services
Join incidents, help solve them, and assist in drafting RCAs and other documentation that are provided directly to customers
About You:
You have at least 8+ years of experience working with production systems
Experienced in managing large-scale production systems
Strong proficiency in multiple programming languages such as Python, Rust or TypeScript
Hands-on experience with containerization technologies like Helm, Docker or Kubernetes
Solid experience with cloud platforms such as AWS, Azure, Google Cloud
Knowledgeable of network protocols, load balancing, and DNS management
Familiar with monitoring and logging tools and best practices
Deployed and monitored systems using infrastructure as code
Excellent problem-solving and troubleshooting skills