Observability Platform Architecture

June 15, 2023 · Part of my work at Carousell

Abstract: Architected and led the migration from a costly SaaS monitoring solution to a powerful, scalable, and cost-effective self-hosted observability platform on Kubernetes.

Tech: #SRE #Observability #Prometheus #VictoriaMetrics #Grafana #Elasticsearch #Kubernetes #Cost Optimization

The Challenge

The engineering teams relied on a SaaS monitoring solution (Datadog) for observability. While functional, it presented several significant challenges as the company scaled:

Prohibitive Costs: The subscription costs were escalating rapidly with the growing number of services and VMs.
Limited Flexibility: The SaaS solution offered limited flexibility for custom metrics and long-term data analysis without incurring further costs.
Lack of Ownership: We had limited control over the platform's performance, availability, and data retention policies.

The Solution

I was tasked with architecting and implementing a more powerful, flexible, and cost-effective observability platform from the ground up. The solution was a modern, self-hosted stack built on best-in-class open-source tools.

Metrics Platform: I designed a new monitoring stack centered around VictoriaMetrics for efficient, long-term time-series data storage, Prometheus for metrics collection across all services, and Grafana for rich, intuitive visualization and dashboards.
Centralized Logging: For logging, I deployed and configured the ELK Stack (Elasticsearch, Logstash, Kibana), creating a centralized location for engineers to search and analyze log data from all applications and infrastructure.
Migration to Kubernetes: To ensure the reliability of the observability platform itself, I led the strategic migration of the entire stack (VictoriaMetrics, ELK, etc.) from legacy VMs to run on a Kubernetes cluster. This provided high availability, scalability, and simplified management.

The Impact

This initiative fundamentally transformed how we monitored our systems and had a major positive impact on the business.

Significant Cost Reduction: The move to a self-hosted solution eliminated costly SaaS subscription fees, resulting in major monthly savings.
Improved System Reliability: The new platform provided deeper, more granular insight into system health, enabling faster incident response and proactive problem-solving.
Empowered Engineering Teams: Engineers were given a powerful, flexible, and unified platform to better understand their services, leading to more robust and performant applications.