Scaling LogHouse: An Evolution in Observability and Data Management

Introduction

About a year ago, we shared the story of LogHouse - our internal logging platform built to monitor ClickHouse Cloud. At the time, it managed what felt like a massive 19 PiB of data.

More than just solving our observability challenges, LogHouse also saved us millions by replacing an increasingly unsustainable Datadog bill. The response to that post was overwhelming. It was clear our experience resonated with others facing similar struggles with traditional observability vendors and underscored just how critical effective data management is at scale.

A year later, LogHouse has grown beyond anything we anticipated and is now storing over 100 petabytes of uncompressed data across nearly 500 trillion rows. That kind of scale forced a series of architectural changes, new tools, and hard-earned lessons that we felt were worth sharing - not least that OpenTelemetry (OTel) isn't always the panacea of Observability (though we still love it), and that sometimes custom pipelines are essential.

Beyond General Purpose: Evolving Observability at Scale

Over the past year, our approach to observability has undergone a significant transformation. We've continued to leverage OpenTelemetry to gather general-purpose logs, but as our systems have scaled, we began to reach its limits. While OTel remains a valuable part of our toolkit, it couldn't fully deliver the performance and precision we needed for our most demanding workloads.

This prompted us to develop purpose-built tools tailored to our critical systems and rethink where generic solutions truly fit. Along the way, we've broadened the range of data we collect and revamped how we present insights to engineers. This continual learning process has taught us that adaptability and discipline are essential in successfully managing observability at scale.

A quick look at the breakdown shows that while our total volume has grown more than 5x, the breakdown reveals a deliberate shift in strategy: today, the vast majority of our data comes from "SysEx", a new purpose-built exporter developed to handle high-throughput, high-fidelity system logs from ClickHouse itself. This shift marks a turning point in our approach to observability pipelines, demonstrating how growth in data volume can coexist with efficiency.


The Evolution of Data Handling: OpenTelemetry's Efficiency Challenges

Initially, we used OpenTelemetry (OTel) for all log collection. It was a great starting point and an established industry standard that allowed us to quickly create a baseline where every pod in our Kubernetes environment shipped logs to ClickHouse. However, as we scaled, we identified two key reasons to build a specialized tool for shipping our core ClickHouse server telemetry.

First, while OTel capably captured the ClickHouse text log via stdout, this represents only a narrow slice of the telemetry ClickHouse exposes. Any ClickHouse expert knows that the real gold lies in its system tables - a rich, structured collection of logs, metrics, and operational insights that far exceeds what is printed to standard output.

Secondly, the inefficiency of the OTel pipeline for this specific task became evident as we scaled. We realized that the path from ClickHouse text logs to OTel format and back into ClickHouse introduced substantial computational overhead and risked data loss. This awareness pushed us to build SysEx, a specialized tool designed to transfer data from one ClickHouse instance to another efficiently, ensuring full fidelity while conserving compute resources. Such persistence in addressing challenges illustrates our commitment to advancing our observability capabilities.

Conclusion

Over the past year, LogHouse has evolved from an ambitious logging system into a foundational observability platform powering everything from performance analysis to real-time debugging across ClickHouse Cloud. What began as a cost-saving measure has become a catalyst for both cultural and technical transformation, shifting us toward high-fidelity, wide-event telemetry at massive scale.

By combining specialized tools like SysEx with general-purpose frameworks such as OpenTelemetry, and layering on flexible interfaces like HyperDX, we have constructed a system that not only keeps pace with our growth but also unlocks entirely new workflows. The journey is far from over, but the lessons from scaling to 100PB and 500 trillion rows continue to shape our understanding of observability as a key data challenge we are embracing.

Questions and Answers

What is LogHouse and why is it important?


LogHouse is an internal logging platform created to monitor ClickHouse Cloud, crucial for effective data management as it allows for scalable observability.

How has LogHouse evolved over the past year?

LogHouse has expanded from managing 19 PiB to over 100 PB of data, necessitating architectural advancements and a shift toward purpose-built tools.

What role does OpenTelemetry play in LogHouse?

OpenTelemetry serves as a general-purpose tool for gathering logs but has limitations that led to the development of more specialized solutions tailored to high-demand workloads.

What is SysEx and its significance in data management?

SysEx is a specialized exporter designed to transfer data efficiently between ClickHouse instances, preserving data fidelity and reducing CPU usage.

How does LogHouse aim for zero-impact scraping?

LogHouse is exploring methods to eliminate in-cluster query execution entirely through zero-impact scraping, minimizing operational impact while maintaining high data fidelity.

tags:observability, ClickHouse, SysEx, OpenTelemetry, data management

Comments

Social

Popular posts from this blog

Revolutionizing Developer Productivity with Shopify's AI Tool, Roast

Master JSON Merging: Best Practices and Step-by-Step Guide

Unveiling Garbage Collection: The Unsung Hero of Memory Management