We are looking for a Senior Site Reliability Engineer with Cloud platform experience. This individual will be part of a team responsible for operating and maintaining production clusters and developing our observability solutions; they will collaborate with team members to develop automation strategies, monitoring & alerting, and ensuring overall platform reliability. Your goal will be to become an integral part of the team, making every challenge of the platform – your own challenge, and solving them accordingly.
Key Responsibilities
Ensure platform reliability and availability across production and pre-production environments through proactive monitoring, alerting, and automation.
First response for incidents, contribute to problem management and root cause analysis.
Supporting the development team's effort towards reliability, creating a solid reliability culture within the development lifecycle.
Develop troubleshooting documentation for production support resources.
Collaborate with Engineering teams to develop optimised and productive runbooks, operational documentation and automation of operational tasks.
Collaborate with development and cloud engineering teams to embed reliability and performance into the software delivery lifecycle.
Design, implement, and evolve observability solutions (metrics, logs, traces, dashboards) using tools such as Prometheus, Grafana, and ELK.
Participate in on-call rotations and continuously improve alert quality and response processes.
Champion a culture of reliability, performance, and continuous improvement across teams.
Required Qualifications
Bachelor's Degree or MS in Engineering or equivalent.
Experience in operating at least one container orchestration cluster (Kubernetes, Docker Swarm).
Experience developing or maintaining software for production services at scale.
Experience with ELK.
Experience with AWS.
Experience with Grafana/Prometheus stack.
Strong scripting skills (Bash, Python or Go).
Excellent communication skills.
Thinking out of the box and anticipating challenges.
About Omilia Natural Language Solutions UA Ltd
Omilia is proud to be an equal opportunity employer and is dedicated to fostering a diverse and inclusive workplace. We believe that embracing diversity in all its forms enriches our workplace and drives our collective success. We are committed to creating an environment where everyone feels welcomed, valued, and empowered to contribute their unique perspectives without regard to factors such as race, color, religion, gender, gender identity or expression, sexual orientation, national origin, heredity, disability, age, or veteran status, all eligible candidates will be given consideration for employment.
Benefits
Fixed compensation;
Long-term employment with the working days vacation;
Development in professional growth (courses, training, etc);
Being part of successful cutting-edge technology products that are making a global impact in the service industry;
Please let Omilia Natural Language Solutions UA Ltd know that you found this role at devopsprojectshq.com as a way to support us, so we can keep providing you with awesome DevOps jobs.