Why I Didn’t Start with Spark, Pinot, or ClickHouse


This question comes up a lot, so it’s worth answering directly:

If you’re doing analytics at scale, why not Spark, Pinot, or ClickHouse?

They’re powerful tools.

They’re also optimized for a different problem.


The Kind of Scale That Matters

There are two kinds of scale:

  1. Data volume

  2. Semantic complexity

Most modern analytics stacks optimize for the first.

This work is mostly about the second.


Spark: Great for Pipelines, Not Reasoning

Spark excels at:

  • Batch processing

  • Large transformations

  • Schema-on-read workloads

But semantic analytics needs:

  • Low latency

  • Fine-grained validation

  • Interactive feedback

You can build that on Spark — but you’ll spend most of your time:

  • Managing jobs

  • Handling latency

  • Debugging execution graphs

It’s a mismatch for question-driven systems.


Pinot and ClickHouse: Fast, But Opinionated

Pinot and ClickHouse are impressive.

They shine when:

  • Queries are known in advance

  • Dimensions are stable

  • Aggregation paths are predictable

Semantic systems break those assumptions:

  • Users define models dynamically

  • Hierarchies evolve

  • New measures appear

You end up fighting the engine:

  • Rebuilding segments

  • Re-indexing constantly

  • Encoding semantics indirectly

Speed doesn’t help if the answer is conceptually wrong.


Semantics First, Engines Later

My bias has been:

  1. Make meaning explicit

  2. Make queries explainable

  3. Make performance incremental

Once semantics are stable, then:

  • Specialized engines make sense

  • Pre-computation becomes safe

  • Cost optimization becomes meaningful

Starting with a heavy engine too early locks you in.


PostgreSQL + Redis as a Control Plane

This combination works well as:

  • A semantic control plane

  • A correctness baseline

  • A performance sandbox

It lets you ask:

  • “What should this system do?”
    Before:

  • “How fast can it go?”

That ordering matters.


This Is Not an Anti-Big-Data Argument

At some point:

  • Pinot might be perfect

  • ClickHouse might slot in cleanly

  • Spark might power offline simulation

The key is timing.

You don’t want your infrastructure to decide your semantics for you.


Final Thought

Fast answers to the wrong question aren’t useful.

Semantic systems live or die on:

  • Trust

  • Clarity

  • Explainability

For that, simpler stacks often win — at least at the beginning.

If you’ve made different trade-offs and they worked, I’d genuinely love to hear about them.


Comments

Popular posts from this blog

A Secure Blazor Server Azure Deployment Pipeline

Stop Wrapping EF Core in Repositories: Use Specifications + Clean Architecture

Server-Sent Events in .NET 10: Do You Really Need SignalR?