Join AION as a Site Reliability Engineer in this exciting remote work opportunity. You will play a pivotal role in enhancing our high-performance AI cloud platform, focusing on system reliability, automation, and scalable infrastructure. This engineering position is ideal for a skilled software engineer passionate about cloud technologies and innovative solutions.
Key Responsibilities
Design and implement comprehensive monitoring and alerting systems across all AION platforms.
Develop automation for infrastructure provisioning, scaling, and recovery using Terraform and Kubernetes.
Create and maintain runbooks and playbooks for handling common operational scenarios and incidents.
Implement service mesh solutions for observability, traffic management, and security using technologies like Istio and Linkerd.
Design and implement logging systems that provide visibility into complex distributed systems.
Responsible for capacity planning and resource optimization across cloud environments.
Implement CI/CD pipelines for reliable and consistent deployments across all environments.
Design and build self-healing systems that automatically recover from common failure modes.
Develop infrastructure for both the compute platform and data annotation services with consistent...
Required Qualifications
3-8 years of experience in Site Reliability Engineering or DevOps
Deep expertise with cloud platforms like AWS, GCP, or Azure
Advanced knowledge of Kubernetes operations, cluster management, and troubleshooting
Strong experience with Infrastructure as Code tools such as Terraform and Pulumi
Expertise in implementing comprehensive monitoring using Prometheus, Grafana, and the ELK stack
Experience with service mesh technologies like Istio and Linkerd
Understanding of network architectures, DNS, load balancing, and security groups
Knowledge of automated deployment pipelines and GitOps workflows
Proficiency in scripting with Bash, Python, or Go
Deep understanding of Docker, containerd, and OCI specifications
Knowledge of infrastructure security best practices and compliance requirements
Experience with incident response, post-mortems, and developing SOP documentation
Preferred Qualifications
Previous work experience at FAANG or top startups and a Tier1 college education are preferred but not required.
About AION
AION is revolutionizing the future of high-performance computing (HPC) with its decentralized AI cloud. Purpose-built for bare-metal performance, AION democratizes access to compute power for AI training and more. Our innovative Proof of Compute Contribution (PoCC) protocol and partnerships with USD-backed economies like Tether ensure a stable and efficient platform. With a global presence and strategic partnerships, AION is at the forefront of bridging the AI wealth gap.
Benefits & Perks
As part of AION, you will enjoy a competitive salary, comprehensive health benefits, flexible work hours, and the opportunity to work in a dynamic and supportive remote environment. You will also be part of a culture that values innovation and promotes continuous learning.