The engineering team at PalUp is at the core of our mission, building and maintaining systems that make our large-scale social platform stable, reliable, and efficient. As a Site Reliability Engineer, you will play a vital role in ensuring the seamless operation of our infrastructure and services, supporting millions of global users while collaborating closely with the broader engineering team to drive innovation and improve system performance.
We’re looking for engineers who value collaboration, fairness, and mutual respect, and who thrive in a dynamic and innovative environment. At PalUp, we focus on solving impactful problems, creating scalable solutions, and empowering teams to deliver world-class experiences to our users. Your expertise in system reliability, scalability, and performance optimization will be critical in shaping the future of AI-driven social interactions.
You’re a skilled and driven engineer with a strong background in site reliability or DevOps. You excel in problem-solving, enjoy automating workflows, and are passionate about designing systems that are both robust and scalable. You thrive in a collaborative environment where innovative ideas are valued, and you’re eager to make a meaningful impact.
3+ years of experience in SRE/DevOps or related roles.
Strong expertise in cloud services and infrastructure (GCP preferred, AWS or Azure is a plus).
Solid knowledge of Linux system administration and maintenance.
Proficiency in programming languages such as Python or Golang.
Hands-on experience with monitoring and alerting systems (Grafana, Prometheus).
Advanced knowledge of Kubernetes and containerization tools like Docker.
Familiarity with log management systems and operational configurations.
Experience with security threat handling and familiarity with OWASP Top 10.
A degree in computer science or related fields.
Experience with relational and non-relational databases (e.g., MySQL, PostgreSQL, MongoDB).
Strong English reading and communication skills for technical documentation.
Automation over manual processes.
Proactive problem-solving and addressing issues before they become critical.
Collaborative teamwork and open communication across teams.
A commitment to improving system reliability and user experiences.
Design, implement, and maintain monitoring and alerting systems to ensure service stability.
Maintain and optimize CI/CD pipelines to improve deployment efficiency and reliability.
Manage and improve cloud-based deployment processes using Docker, Kubernetes, and related tools.
Analyze system bottlenecks and proactively implement architectural and performance optimizations.
Collaborate with development teams to ensure high availability and fault tolerance of applications and databases.
Develop scripts and automation tools (e.g., Python, Shell scripts) to streamline operational tasks.