As a Site Reliability (Cloud Engineer)-I, you will be responsible for being a combination of both Ops/Support and Site reliability experience and most importantly, a “can-do” attitude and strong sense of ownership. These services are offered as Managed Service/SaaS and hence total ownership of the solution, securing, keeping it always up and running remains with us. Being part of a critical healthcare application and keeping it up and running 24x7/365 is very critical and stakes are high. We are seeking an experienced engineering team member who brings a combination of both Ops/Support and Site reliability experience and most importantly, a “can-do” attitude and strong sense of ownership.
A Day in the Life
In this role you will be responsible for various pillars of SRE - Deployment, Reliability, Scalability, Service Availability - SLA/SLO/SLI, Performance, Cost etc
Lead production roll out of new releases/emergency patches using CICD pipelines and constantly improving pipelines. Establish a solid production promotion/change management process with a solid quality gate working across Dev/QA teams
Roll out a solid observability stack across various components of the tech stack so as to proactively detect outage v/s service degradation before the customer notifies us
Apply strong analytical skills to understand production system metrics, drive change, optimize system utilization and drive cost efficiency
Autoscale/down the platform during peak season scenarios
Understand end to end platform architecture and how to best and fast perform triage/RCA by looking at various data points derived from observability tool chain
You will work towards reducing the number of alerts/escalation to the next level team – dev/devops
You will be part of the 24x7 OnCall Production Support team
Lead monthly operations review with the executive team. Some examples include, but are not limited to – Platform/Application/Infrastructure KPIs -UpTime, RCA , CAP (Corrective Action Plan) and PAP (Preventive Action Plan), security reports, audit reports
You will be responsible for Operating and Managing production and staging cloud platforms, responsible for Ops (executing/automation runbook/SOP/ Maintain up-time/SLA) as well as Site Reliability engineering
Collaboration is key to this role so as to work across a spectrum of teams - Dev/DevOps/QA/Customer Success etc. derive RCA/5 why analysis and drive product improvements
Ensure that the Platform is secured as per guidelines established by CISO. e,g, Secure against DDoS attacks by implementing WAF, Vulnerability and Patch management, install required security agents etc
Lead least privilege based RBAC for various production services and tool chain
Build and execute Disaster Recovery plan
Key stakeholder to participate in case of IR (Incident Response)
What You Need
Technical
Solid experience with at least one of the clouds with automation focus is MUST - AWS, Azure, GCP. Certification has advantages
Building reliability, scalability and performance systems in Production. This requires significant engineering experience and risk evaluation
Log/Metrics/Tracing tool chain experience is MUST to have; strong analytical skills to understand various data points to understand platform behaviour/RCA
Hands-on experience with Kubernetes along with Linux is MUST to have
Programming experience with scripting languages e.g. Python is MUST
Must be good at documenting and structuring documents be it process or RCA
Experience working in a 24x7 Production environment with process focus is preferred
Ticketing system, Incident management experience is preferred
Security background and security first approach mindset is preferred
Experience with CICD pipelines and tool chains is preferred
Hands-on experience with a few of these - Kafka, Postgres, Snowflake etc. is preferred
Personality Trait
Must be able to perform with cool head under pressure situations without taking any shortcuts during production issues
Collaboration with solid verbal and oral communication skills are very critical to this role
Possesses excellent verbal and written communication skills and the ability to interact professionally with a diverse group of developers, product owners, and subject matter experts
Strong cross-functional collaboration skills, relationship building skills, and ability to achieve results without direct reporting relationships
Ability to quickly identify and drive to the optimal solution when presented with a series of constraints
Excellent judgment, analytical thinking, and problem-solving skills
Self-motivated individual that possesses excellent time management and organizational skills
Strong sense of personal responsibility and accountability for delivering high quality work
Observability - ElasticSearch, Prometheus, Grafana, Jaeger, NewRelic etc.
What We Offer
Industry-Focused Certifications: Meet leading healthcare experts, discuss innovative strategies, and become a subject matter expert with our comprehensive set of certifications
Rewards and Recognition: Feeling like you’re outperforming on your projects? Get recognition for your dedicated efforts and demonstrated work ethic
Health Insurance and Mental Well-being: We offer health benefits and insurance to you and your family for hospital-related expenses pertaining to any illness, disease, or injury. We also have Employee Assistance Programs (EAPs) to give you 24X7 access to certified therapists and psychologists
Sabbatical Leave Policy: Do you want to focus on skill development, pursue an academic career, or just reset? We’ve got you covered
Open Floor Plan:Cubicles are a thing of the past and to modernize our office space, we have open floor sittings at every office location. Share ideas with your peers and bond better in an open floor office where there are no barriers and you are inspired to be creative
Paternity and Maternity Leave: Enjoy the industry’s best parental leave policy to welcome your bundle of joy and enjoy quality time with them
Paid Time Off: Maintain a healthy work–life balance and take time off from work to focus on your well-being and big life moments
Please let Innovaccer know that you found this role at devopsprojectshq.com as a way to support us, so we can keep providing you with awesome DevOps jobs.
Ready to land your dream job?
Create your profile and let companies find you!
Built and hosted in the EU 🇪🇺 we keep your data safe