Senior Site Reliability Engineer (SRE)
at Carousell Group · Jan 2023 - Present
As a Senior Site Reliability Engineer at Carousell Group, I was instrumental in a large-scale initiative to overhaul a complex, high-traffic legacy environment. My role blended SRE, Platform Engineering, and DevOps principles to enhance stability, developer experience, and cost efficiency. I successfully reduced monthly cloud expenditure by nearly 50% while leading critical migrations to a modern, cloud-native stack.
Cloud Cost Optimization & Financial Governance
Engineered and executed a multi-faceted cost-saving strategy that significantly reduced monthly cloud expenditure, with a trajectory toward 50%.
- Infrastructure Rightsizing: Led initiatives to migrate stateful workloads like Elasticsearch, Kafka, and Thumbor from VMs to a more cost-effective Kubernetes environment.
- Resource Optimization: Conducted deep resource utilization analysis across Kubernetes clusters and optimized workload requests, resulting in substantial cost savings.
- Storage Tiering & Optimization: Implemented PVC storage optimization and AWS S3 tiering, drastically cutting storage costs.
- Architectural Unification: Re-architected and centralized disparate load balancers, improving security and maintainability while reducing operational costs.
System Modernization & Kubernetes Migration
Spearheaded the migration of the entire infrastructure from legacy virtual machines to a standardized, highly-available Kubernetes environment.
- Legacy Application Containerization: Successfully containerized a large, monolithic legacy PHP application with obsolete libraries and multiple database dependencies (MySQL, PostgreSQL, MongoDB), unblocking modern development workflows.
- Stateful Service Migration: Orchestrated the complex migration of critical stateful services—including Elasticsearch, Kafka, Redis, and Memcached—to a shared Kubernetes cluster, dramatically improving reliability and SRE management experience.
- Database Modernization: Managed and executed major version upgrades for critical PostgreSQL and MongoDB database clusters to ensure compatibility with modern infrastructure.
Platform Engineering & Developer Experience
Championed the adoption of a "Common Tech Foundation" (CTF) to standardize development and deployment, boosting developer velocity.
- CI/CD Architecture: Replaced insecure, manual deployment processes with automated CI/CD pipelines using Jenkins and later migrated workflows to a centralized platform on GitHub Actions.
- Centralized Tooling: Implemented and managed centralized solutions for secrets management (Hashicorp Vault), configuration, service definition, and various internal tooling, improving security and developer efficiency.
- Infrastructure as Code (IaC): Led the codification of all cloud infrastructure using Terraform and Terragrunt, establishing a version-controlled, auditable, and repeatable system for infrastructure management.
Security & Compliance Fortification
Executed a multi-pronged strategy to harden the security posture of the entire platform.
- Zero Trust Architecture: Implemented Cloudflare Zero Trust and overhauled GCP firewall rules to a "default-deny" policy, drastically reducing the attack surface.
- Threat Detection & Monitoring: Deployed and managed a centralized SIEM by onboarding all VMs to Wazuh for intrusion detection and configured the GCP Security Command Center.
- Data Residency Compliance: Architected and built a dedicated infrastructure environment in Vietnam, including setting up a hypervisor on a physical server and implementing nested database replication (MySQL & PostgreSQL) to comply with local data residency regulations.
Advanced Observability
Architected and deployed a comprehensive, cost-effective observability platform to replace a costly SaaS solution.
- Monitoring Stack: Implemented a self-hosted stack using VictoriaMetrics, Prometheus, and Grafana, migrating the entire platform to Kubernetes for high availability.
- Data-Driven Insights: Empowered engineers with secure, read-only access to production data by building a real-time replication pipeline from 37 production databases to Google BigQuery using GCP Datastream.