Site Reliability Engineer
TaipeiLINE TaiwanEngineeringFull-time
The Site Reliability Engineering (SRE) team at LINE Taiwan designs, builds, and operates the infrastructure that powers LINE’s services. We collaborate closely with engineering teams to ensure our systems are reliable, scalable, and efficient at serving millions of users. As an SRE, you will tackle complex infrastructure challenges, drive automation, and improve the overall reliability and performance of LINE's platforms.
Responsibility:
- Collaborate with service teams during the system design phase, providing guidance on architecture, capacity planning, and reliability best practices.
- Enhance monitoring, alerting, and incident response processes; continuously improve observability to ensure production service stability and performance.
- Automate operational tasks through tools and systems to drive sustainable scalability and eliminate repetitive work.
- Lead and participate in postmortems and root cause analyses to foster a culture of learning and resilience.
- Continuously improve infrastructure and operational processes to enhance service reliability, development velocity, and efficiency.
- Operate and evolve internal platforms (e.g. private cloud, observability systems, CI/CD) to support the needs of engineering teams across the organization.
- Maintain and optimize Kubernetes environments, including upgrades, scaling, resource tuning, and performance improvements.
- Collaborate with internal engineering teams to troubleshoot performance and availability issues, ensuring seamless coordination between development and operations.
Qualifications:
- 3–5+ years of experience in Site Reliability Engineering, DevOps, or related roles, with the ability to independently design and implement infrastructure solutions.
- Hands-on experience in deploying, operating, and maintaining production-grade systems and platforms.
- Proficient in Kubernetes ecosystem components such as Helm, Operators, Kustomization.
- Experience managing large-scale cloud infrastructure in public (e.g. AWS, GCP, Azure) or private cloud environments.
- Strong programming skills in one or more of the following: Go, Python, Shell scripting, or similar languages.
- Familiar with Infrastructure as Code (IaC) tools such as Terraform.
- Strong observability skills with experience in: Prometheus, Loki, Tempo, OpenTelemetry.
- Hands-on experience with DevOps practices, including:
- CI/CD pipelines (e.g. GitHub Actions, Jenkins)
- GitOps-based deployment and cluster management (e.g. ArgoCD)
- Solid knowledge of Linux systems administration, including performance tuning and troubleshooting.
- Understanding of distributed system design, networking fundamentals (e.g. TCP/IP, UDP), load balancing, and storage systems.
- Excellent analytical and debugging skills, especially in complex, distributed environments.
- Experience with performance testing, disaster recovery planning, and capacity management.
Bonus Point (Nice to have)
- Relevant Kubernetes certifications such as CKA, CKAD, or CKS.
- Experience in designing and operating large-scale observability architectures, including logging, tracing, and metrics systems.
- Familiarity with eBPF-based observability tools such as Beyla, Cilium, BCC, or bpftrace.
- Knowledge of developing custom Terraform providers or contributing to IaC tooling.
Interview Process:
- CV Screening and on-line assessment
- The 1st interview with team mates (on-line)
- The 2nd interview with hiring manager (on-site)
- The final interview with CTO (on-line)