As a Site Reliability Engineer (SRE) at Ververica, you will design, provision, and maintain the infrastructure for Ververica’s Unified Streaming Data Platform across multiple cloud providers, including AWS, GCP, and Azure. Your role will involve architectural improvements, implementation ownership, and driving reliability best practices.
Key Responsibilities
Build and maintain the infrastructure for Ververica’s Unified Streaming Data Platform across AWS, GCP, and Azure.
Design and manage Infrastructure as Code (IaC) using Terraform.
Implement and enhance observability tooling, including Grafana, Prometheus, logging systems, traces, metrics, dashboards, and alerts.
Ensure system reliability through SRE best practices.
Improve infrastructure architecture and engineering efficiency through continuous evaluation and optimization.
Enhance CI/CD pipelines to automate development workflows.
Monitor, identify, and resolve security vulnerabilities.
Contribute to the successful development and launch of new products, features, and services.
Participate in on-call rotations to manage incidents in a 24/7 live infrastructure.
Maintain and update documentation.
Required Qualifications
Bachelor’s degree in Computer Science, Information Technology, or a related field.
Minimum 2 years of hands-on experience with Kubernetes clusters, Helm charts, controllers, and operators.
Proficiency in designing and maintaining Terraform code.
Strong knowledge of observability tools and practices.
Experience implementing SRE principles.
Solid understanding of Linux systems and networking in cloud environments.
Please let Ververica GmbH know that you found this role at devopsprojectshq.com as a way to support us, so we can keep providing you with awesome DevOps jobs.