Ready to embark on the quest of joining Hack The Box? At the end of this thrilling journey, you'll become a proud member of Hack The Box, with the ultimate mission to help cybersecurity professionals and organizations enhance their cyber-attack readiness. Get ready for an exciting adventure into the world of cybersecurity! 🚀🔒💻
The Core Mission of the Site Reliability Engineer (SRE): As a Site Reliability Engineer at Hack The Box, your paramount mission is to assist the seamless migration to AWS, strategically positioning our infrastructure to scale effectively with the company. Over the next 6 months, you will participate in enhancing our capabilities for expansion, setting the stage for the addition of new systems such as Kubernetes clusters, Services, and Databases. Additionally, your focus will shift towards establishing key performance indicators, service level objectives, and incident response metrics to drive a culture of reliability and continuous improvement. 🏢
The Fellowship You'll Be Joining: You’ll join a team of 6 SREs, while collaborating closely with engineers, data scientists, and security experts. Finally, you will report directly to the SRE Manager and will have open communication with infrastructure department management and other high-caliber technical people across the organization.
Technology Tools & Weapons You'll Be Using: Infrastructure as Code (Terraform): Automate the provisioning of AWS resources. Containerization and Orchestration (Kubernetes, Flux CD): Ensure seamless deployment and scaling of applications. Monitoring and Logging (Prometheus, Mimir, Grafana, Loki): Expand monitoring capabilities for new systems. Automation and Scripting (Go, Python, etc): Scripting for efficient and automated processes. Cloud Platforms (AWS): Execute the migration plan with a focus on AWS.
The Adventures That Await You After Becoming a Site Reliability Engineer at Hack The Box: Heavily contribute to the AWS Migration for Scalability: Spearhead the migration from the current cloud provider towards AWS, strategically positioning our infrastructure for scalable growth across regions. Expand Monitoring Stack: Integrate new systems into the Monitoring Stack, enhancing visibility and alerting capabilities for a globally distributed architecture. Architectural Design for Reliability: Contribute to the design and implementation of reliable AWS infrastructure, focusing on fault tolerance and high availability. Establish Metrics Framework: Implement and manage Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) to measure and improve system reliability. Incident Response Enhancement: Develop and enhance incident response processes, leveraging metrics to continually improve response times and effectiveness. Mentorship: Mentor and guide junior SREs in adapting to the AWS environment and implementing reliability best practices. Collaborative Planning: Work closely with cross-functional teams to plan and implement new systems effectively, ensuring alignment with reliability goals. Team Expansion: Play a key role in the team's expansion, contributing to the mentoring junior members. Best Practices Advocacy: Champion best practices in AWS architecture and SRE methodologies, fostering a culture of reliability and continuous improvement.