Why I Didn’t Start with Spark, Pinot, or ClickHouse
This question comes up a lot, so it’s worth answering directly:
If you’re doing analytics at scale, why not Spark, Pinot, or ClickHouse?
They’re powerful tools.
They’re also optimized for a different problem.
The Kind of Scale That Matters
There are two kinds of scale:
-
Data volume
-
Semantic complexity
Most modern analytics stacks optimize for the first.
This work is mostly about the second.
Spark: Great for Pipelines, Not Reasoning
Spark excels at:
-
Batch processing
-
Large transformations
-
Schema-on-read workloads
But semantic analytics needs:
-
Low latency
-
Fine-grained validation
-
Interactive feedback
You can build that on Spark — but you’ll spend most of your time:
-
Managing jobs
-
Handling latency
-
Debugging execution graphs
It’s a mismatch for question-driven systems.
Pinot and ClickHouse: Fast, But Opinionated
Pinot and ClickHouse are impressive.
They shine when:
-
Queries are known in advance
-
Dimensions are stable
-
Aggregation paths are predictable
Semantic systems break those assumptions:
-
Users define models dynamically
-
Hierarchies evolve
-
New measures appear
You end up fighting the engine:
-
Rebuilding segments
-
Re-indexing constantly
-
Encoding semantics indirectly
Speed doesn’t help if the answer is conceptually wrong.
Semantics First, Engines Later
My bias has been:
-
Make meaning explicit
-
Make queries explainable
-
Make performance incremental
Once semantics are stable, then:
-
Specialized engines make sense
-
Pre-computation becomes safe
-
Cost optimization becomes meaningful
Starting with a heavy engine too early locks you in.
PostgreSQL + Redis as a Control Plane
This combination works well as:
-
A semantic control plane
-
A correctness baseline
-
A performance sandbox
It lets you ask:
-
“What should this system do?”
Before: -
“How fast can it go?”
That ordering matters.
This Is Not an Anti-Big-Data Argument
At some point:
-
Pinot might be perfect
-
ClickHouse might slot in cleanly
-
Spark might power offline simulation
The key is timing.
You don’t want your infrastructure to decide your semantics for you.
Final Thought
Fast answers to the wrong question aren’t useful.
Semantic systems live or die on:
-
Trust
-
Clarity
-
Explainability
For that, simpler stacks often win — at least at the beginning.
If you’ve made different trade-offs and they worked, I’d genuinely love to hear about them.

Comments
Post a Comment