Site Reliability Engineer_Reliability & Platform Team

Find Your Next Career

Site Reliability Engineer_Reliability & Platform Team

TaipeiLINE TaiwanEngineeringFull-time

The Site Reliability Engineering (SRE) team at LINE Taiwan designs, builds, and operates the infrastructure that powers LINE’s services. We collaborate closely with engineering teams to ensure our systems are reliable, scalable, and efficient at serving millions of users. As an SRE, you will tackle complex infrastructure challenges, drive automation, and improve the overall reliability and performance of LINE's platforms.

Responsibility:

Collaborate with service teams during the system design phase, providing guidance on architecture, capacity planning, and reliability best practices.
Enhance monitoring, alerting, and incident response processes; continuously improve observability to ensure production service stability and performance.
Lead and participate in postmortems and root cause analyses to foster a culture of learning and resilience.
Continuously improve infrastructure and operational processes to enhance service reliability, development velocity, and efficiency.
Maintain and optimize Kubernetes environments, including upgrades, scaling, resource tuning, and performance improvements.
Collaborate with internal engineering teams to troubleshoot performance and availability issues, ensuring seamless coordination between development and operations.
SLO / Error-Budget Ownership – Partner with product and engineering teams to define meaningful SLIs and SLOs, continuously monitor error-budget burn-down, and trigger reliability improvements whenever the budget is at risk.
On-Call & Incident Command – Design and maintain a 24×7 on-call rotation with clear escalation paths, and serve as Incident Commander during critical events to coordinate cross-team response, mitigation, and post-incident communication.

Qualifications:

3–5+ years of experience in Site Reliability Engineering, DevOps, or related roles, with the ability to independently design and implement infrastructure solutions.
Hands-on experience in deploying, operating, and maintaining production-grade systems and platforms.
Proficient in Kubernetes ecosystem components such as Helm, Operators, Kustomization.
Experience managing large-scale cloud infrastructure in public (e.g. AWS, GCP, Azure) or private cloud environments.
Strong programming skills in one or more of the following: Go, Python, Shell scripting, or similar languages.
Familiar with Infrastructure as Code (IaC) tools such as Terraform.
Strong observability skills with experience in: Prometheus, Loki, Tempo, OpenTelemetry.
Solid knowledge of Linux systems administration, including performance tuning and troubleshooting.
Understanding of distributed system design, networking fundamentals (e.g. TCP/IP, UDP), load balancing, and storage systems.
Excellent analytical and debugging skills, especially in complex, distributed environments.
Experience with performance testing, disaster recovery planning, and capacity management.

Bonus Point (Nice to have)

Relevant Kubernetes certifications such as CKA, CKAD, or CKS.
Experience in designing and operating large-scale observability architectures, including logging, tracing, and metrics systems.
Familiarity with eBPF-based observability tools such as Beyla, Cilium, BCC, or bpftrace.
Knowledge of developing custom Terraform providers or contributing to IaC tooling.

Interview Process:

CV Screening and on-line assessment
The 1st interview with team mates (on-line)
The 2nd interview with hiring manager (on-site)
The final interview with CTO (on-line)

LINE Careers

Find Your Next Career

Site Reliability Engineer_Reliability & Platform Team

Relevant Jobs