Experience with public cloud infrastructure (e.g., AWS, Azure) and related technologies (e.g., Docker, Kubernetes, Cloud Formation);
Good understanding of storage and database systems, caching and queueing, networking;
Experience of leading technical recoveries
Working knowledge of Service Management practices (ITIL).
Experience designing, analyzing, and troubleshooting distributed systems;
Ability to debug, optimize code, and automate routine operational tasks;
Solid foundation in Linux or Windows administration and troubleshooting;
Monitoring / observability technologies like Prometheus, Grafana, Kibana, Elasticsearch are a plus;
Understanding of Service level agreements and objectives;
Excellent command of the English language, both written and spoken;
Solid understanding of programming principles and good command of at least one programming language relevant for infrastructure work;
What we offer
Direct cooperation with the already successful, long-term, and growing project;
Truly competitive salary;
Working with top-notch equipment;
Help and support from our caring HR team;
Responsibilities
Design, develop and implement systems software that improves the stability, scalability, availability and robustness of Odido’s products and services -- now and for years to come;
Develop patterns for automation, instrumentation etc., that can be reused across teams and products;
Take ownership of several services and products;
Automate instead of fixing operational issues manually;
Develop and implement strategies for effective and proactive monitoring and observability of our systems;
Provide senior technical leadership on Major Incident calls. Take technical ownership of service outage recoveries. Drive internal and partner resources to rapidly restore service implementing best practice technical fixes and workarounds. Utilize technical expertise to shape and implement recovery plans;
Manage cross functional technical resources following Major Incidents to ensure root cause is fully understood and documented, and that robust service protection measures are in place. Provide technical expertise at Incident Wash-ups ensuring that all appropriate actions are in place to prevent repeat Incidents, and to improve recovery times.
Triage and fix system issues in a complicated distributed landscape;
Participate in an on-call rotation, including weekend or after hours coverage;
Oversee and continuously improve incident-response processes at Odido;
Advocate engineering best practices across the company, mentor more junior engineers on automation and operational best practices;
Contribute to Odido's growth through interviewing and onboarding;
Please let Globaldevgroup know that you found this role at devopsprojectshq.com as a way to support us, so we can keep providing you with awesome DevOps jobs.
Ready to land your dream job?
Create your profile and let companies find you!
Built and hosted in the EU 🇪🇺 we keep your data safe