Skip to Content

Senior Site Reliability Engineer (SRE)

at Carousell Group · Jan 2023 - Present

As a Senior Site Reliability Engineer at Carousell Group, I was instrumental in a large-scale initiative to overhaul a complex, high-traffic legacy environment. My role blended SRE, Platform Engineering, and DevOps principles to enhance stability, developer experience, and cost efficiency. I successfully reduced monthly cloud expenditure by nearly 50% while leading critical migrations to a modern, cloud-native stack.

Cloud Cost Optimization & Financial Governance

Engineered and executed a multi-faceted cost-saving strategy that significantly reduced monthly cloud expenditure, with a trajectory toward 50%.

  • Infrastructure Rightsizing: Led initiatives to migrate stateful workloads like Elasticsearch, Kafka, and Thumbor from VMs to a more cost-effective Kubernetes environment.
  • Resource Optimization: Conducted deep resource utilization analysis across Kubernetes clusters and optimized workload requests, resulting in substantial cost savings.
  • Storage Tiering & Optimization: Implemented PVC storage optimization and AWS S3 tiering, drastically cutting storage costs.
  • Architectural Unification: Re-architected and centralized disparate load balancers, improving security and maintainability while reducing operational costs.

System Modernization & Kubernetes Migration

Spearheaded the migration of the entire infrastructure from legacy virtual machines to a standardized, highly-available Kubernetes environment.

  • Legacy Application Containerization: Successfully containerized a large, monolithic legacy PHP application with obsolete libraries and multiple database dependencies (MySQL, PostgreSQL, MongoDB), unblocking modern development workflows.
  • Stateful Service Migration: Orchestrated the complex migration of critical stateful services—including Elasticsearch, Kafka, Redis, and Memcached—to a shared Kubernetes cluster, dramatically improving reliability and SRE management experience.
  • Database Modernization: Managed and executed major version upgrades for critical PostgreSQL and MongoDB database clusters to ensure compatibility with modern infrastructure.

Platform Engineering & Developer Experience

Championed the adoption of a "Common Tech Foundation" (CTF) to standardize development and deployment, boosting developer velocity.

  • CI/CD Architecture: Replaced insecure, manual deployment processes with automated CI/CD pipelines using Jenkins and later migrated workflows to a centralized platform on GitHub Actions.
  • Centralized Tooling: Implemented and managed centralized solutions for secrets management (Hashicorp Vault), configuration, service definition, and various internal tooling, improving security and developer efficiency.
  • Infrastructure as Code (IaC): Led the codification of all cloud infrastructure using Terraform and Terragrunt, establishing a version-controlled, auditable, and repeatable system for infrastructure management.

Security & Compliance Fortification

Executed a multi-pronged strategy to harden the security posture of the entire platform.

  • Zero Trust Architecture: Implemented Cloudflare Zero Trust and overhauled GCP firewall rules to a "default-deny" policy, drastically reducing the attack surface.
  • Threat Detection & Monitoring: Deployed and managed a centralized SIEM by onboarding all VMs to Wazuh for intrusion detection and configured the GCP Security Command Center.
  • Data Residency Compliance: Architected and built a dedicated infrastructure environment in Vietnam, including setting up a hypervisor on a physical server and implementing nested database replication (MySQL & PostgreSQL) to comply with local data residency regulations.

Advanced Observability

Architected and deployed a comprehensive, cost-effective observability platform to replace a costly SaaS solution.

  • Monitoring Stack: Implemented a self-hosted stack using VictoriaMetrics, Prometheus, and Grafana, migrating the entire platform to Kubernetes for high availability.
  • Data-Driven Insights: Empowered engineers with secure, read-only access to production data by building a real-time replication pipeline from 37 production databases to Google BigQuery using GCP Datastream.