Observability Platform Architecture
· Part of my work at Carousell
Abstract: Architected and led the migration from a costly SaaS monitoring solution to a powerful, scalable, and cost-effective self-hosted observability platform on Kubernetes.
Tech: #SRE#Observability#Prometheus#VictoriaMetrics#Grafana#Elasticsearch#Kubernetes#Cost Optimization
The Challenge
The engineering teams relied on a SaaS monitoring solution (Datadog) for observability. While functional, it presented several significant challenges as the company scaled:
- Prohibitive Costs: The subscription costs were escalating rapidly with the growing number of services and VMs.
- Limited Flexibility: The SaaS solution offered limited flexibility for custom metrics and long-term data analysis without incurring further costs.
- Lack of Ownership: We had limited control over the platform's performance, availability, and data retention policies.
The Solution
I was tasked with architecting and implementing a more powerful, flexible, and cost-effective observability platform from the ground up. The solution was a modern, self-hosted stack built on best-in-class open-source tools.
- Metrics Platform: I designed a new monitoring stack centered around VictoriaMetrics for efficient, long-term time-series data storage, Prometheus for metrics collection across all services, and Grafana for rich, intuitive visualization and dashboards.
- Centralized Logging: For logging, I deployed and configured the ELK Stack (Elasticsearch, Logstash, Kibana), creating a centralized location for engineers to search and analyze log data from all applications and infrastructure.
- Migration to Kubernetes: To ensure the reliability of the observability platform itself, I led the strategic migration of the entire stack (VictoriaMetrics, ELK, etc.) from legacy VMs to run on a Kubernetes cluster. This provided high availability, scalability, and simplified management.
The Impact
This initiative fundamentally transformed how we monitored our systems and had a major positive impact on the business.
- Significant Cost Reduction: The move to a self-hosted solution eliminated costly SaaS subscription fees, resulting in major monthly savings.
- Improved System Reliability: The new platform provided deeper, more granular insight into system health, enabling faster incident response and proactive problem-solving.
- Empowered Engineering Teams: Engineers were given a powerful, flexible, and unified platform to better understand their services, leading to more robust and performant applications.