Purpose
UneeQ is the global standard for digital humans, enabling creators and brands to bring impactful interaction into our digital world. We are seeking a Senior Site Reliability Engineer to work as part of the SRE team to ensure our platform is scalable, resilient, and reliable by:
- Providing an infrastructure platform to the development team that handles scaling, failing, and monitoring.
- Managing the security and availability of that infrastructure platform.
- Recommending infrastructure solutions for business and customer requirements in a way that balances the business needs with best practice engineering practices and architecture.
- Collaborating across multiple internal teams to design and develop infrastructure solutions supporting our business and technical strategies.
- Facilitating change management via automated CI/CD pipelines.
- Providing tooling to the development team to create a build and release pipeline that adheres to change management best practices.
- Maintaining incident management processes, including on-call rosters and post-mortems.
- Ensuring all our applications and services have measurable SLOs and develop observability tools, frameworks, and processes.
- Monitoring the infrastructure costs against an agreed budget and work with technical and business stakeholders to align expectations and address discrepancies.
- Working with the development team to expose metrics and open up monitoring opportunities.
This role is New Zealand based and reports to the Lead SRE. UneeQ is a remote-first workplace meaning you will mostly be working from your home.
What you’re trying to achieve
- Increase development team efficiency by providing infrastructure and tooling that makes their lives easier.
- Ensure that UneeQ meets or exceeds availability, performance, and security SLAs.
- Ensure that our processes (security, change management, incident management, etc.) adhere to best practices.
- Ensure we can report to stakeholders about our system performance and availability.
- Use vendors to save time and effort while keeping track of the infrastructure spend.
- Maintain a culture and habit of continuous improvement.
How we’ll measure success
The primary qualitative metric is the perception of adding value to the rest of the team, which is assessed regularly via peer feedback. Performance against quantitative SLAs includes:
- Availability
- Average time to respond
- Average time to repair
- Spend is within budget, etc.
General competencies that will help you
- Empathy to peers and stakeholders
- Attention to detail
- Systems thinking
- Process maturity building
- Hunger for learning
- Desire to be of service to others
- Can-do attitude
- Healthy skepticism