Skip to Content
All Projects

Observability Platform Architecture

 · Part of my work at Carousell

Abstract:  Architected and led the migration from a costly SaaS monitoring solution to a powerful, scalable, and cost-effective self-hosted observability platform on Kubernetes.

Tech: #SRE#Observability#Prometheus#VictoriaMetrics#Grafana#Elasticsearch#Kubernetes#Cost Optimization

The Challenge

The engineering teams relied on a SaaS monitoring solution (Datadog) for observability. While functional, it presented several significant challenges as the company scaled:

  • Prohibitive Costs: The subscription costs were escalating rapidly with the growing number of services and VMs.
  • Limited Flexibility: The SaaS solution offered limited flexibility for custom metrics and long-term data analysis without incurring further costs.
  • Lack of Ownership: We had limited control over the platform's performance, availability, and data retention policies.

The Solution

I was tasked with architecting and implementing a more powerful, flexible, and cost-effective observability platform from the ground up. The solution was a modern, self-hosted stack built on best-in-class open-source tools.

  1. Metrics Platform: I designed a new monitoring stack centered around VictoriaMetrics for efficient, long-term time-series data storage, Prometheus for metrics collection across all services, and Grafana for rich, intuitive visualization and dashboards.
  2. Centralized Logging: For logging, I deployed and configured the ELK Stack (Elasticsearch, Logstash, Kibana), creating a centralized location for engineers to search and analyze log data from all applications and infrastructure.
  3. Migration to Kubernetes: To ensure the reliability of the observability platform itself, I led the strategic migration of the entire stack (VictoriaMetrics, ELK, etc.) from legacy VMs to run on a Kubernetes cluster. This provided high availability, scalability, and simplified management.

The Impact

This initiative fundamentally transformed how we monitored our systems and had a major positive impact on the business.

  • Significant Cost Reduction: The move to a self-hosted solution eliminated costly SaaS subscription fees, resulting in major monthly savings.
  • Improved System Reliability: The new platform provided deeper, more granular insight into system health, enabling faster incident response and proactive problem-solving.
  • Empowered Engineering Teams: Engineers were given a powerful, flexible, and unified platform to better understand their services, leading to more robust and performant applications.