6

I'm new to druid. I've already read "druid VS Elasticsearch", but I still don't know what druid is good at.

Below is my problem:

  1. I have a solr cluster with 70 nodes.

  2. I have a very big table in solr which has 1 billion rows, and each row has 100 fields.

  3. The user will use different combinations range query of fields (20 combinations at least in one query) to count the distinct number of customer id, but the solr's distinct count algorithm is very slow and uses a lot of memory, so if the query result is more than 200 thousand, the solr's query node will crash.

Does druid has better performance than solr in distinct count?

Davos
  • 5,066
  • 42
  • 66
zhouxiang
  • 153
  • 3
  • 12
  • Solr 5.2 can use HyperLogLog for cardinality count: https://lucidworks.com/blog/2015/05/26/hyperloglog-field-value-cardinality-stats/ – Toke Eskildsen Oct 28 '16 at 11:44
  • Elasticsearch is based on Lucene, so you are comparing 3 different frameworks here. You can update the title or description accordingly. – avp Jul 10 '19 at 06:06

3 Answers3

9

Druid is vastly different from search-specific databases like ES/Solr. It is a database designed for analytics, where you can do rollups, column filtering, probabilistic computations, etc.

Druid does count distinct through its use of HyperLogLog, which is a probabilistic data-structure. So if you dont worry about 100% accuracy, you can definitely try Druid and I have seen drastic improvements in response times in one of my projects. But, if you care about accuracy, then Druid might not be the best solution (even though it is quite possible to achieve in Druid as well, with performance hits and extra space taken up) - see more here: https://groups.google.com/forum/#!topic/druid-development/AMSOVGx5PhQ

  • 6
    FYI: ES supports `cardinality` aggregation which is based on HyperLogLog++ algo. – Kenji Noguchi Oct 12 '16 at 03:01
  • oh nice! didn't know about it. You have any metrics on how much the index size gets reduced if you use `cardinality` aggregation? – Ramkumar Venkataraman Oct 12 '16 at 04:51
  • `cardinality` is query time aggregation. I don't think it helps in terms of index size. – Kenji Noguchi Oct 14 '16 at 00:42
  • Ah, got it! Druid supports both index time and query time aggregation based on HLL. Index time aggregator significantly reduces the size of storage for high cardinality dimensions. – Ramkumar Venkataraman Oct 14 '16 at 06:41
  • Druid also supports DataSketches extension see https://druid.apache.org/docs/latest/development/extensions-core/datasketches-extension.html; the Theta Sketch approximate distinct count works well, Nielsen Marketing Cloud has had success with it https://www.slideshare.net/ItaiYaffe/our-journey-with-druid-from-initial-research-to-full-production-scale/20 – Davos Oct 03 '19 at 06:20
8

ES typically needs raw data because it's designed for search. It means the index is huge yet nested aggregations is expensive. (I know I skipped a lot of details here).

Druid is designed for metric calculation over timeseries data. It has clear distinction of dimensions and metrics. Based on dimension fields, the metric fields are pre-aggregated at the time of ingestion. This step helps reducing huge amount of data depending on cardinality of the dimensional data. In other words, Druid works best when the dimension is categorical value.

You mentioned range query. Range filter on metrics works great. But if you mean query by numerical dimensions that's something Druid is still work in progress.

As for the distinct count, both ES and Druid support HyperLogLog. In Druid, you have to specify fields at the time of ingestion in order to apply HyperLogLog at the query time. It's pretty fast and efficient.

Kenji Noguchi
  • 1,752
  • 2
  • 17
  • 26
1

Recent versions (6.x AFAIK) of Elasticsearch support your use case and you will get the result from all 3 (Druid, ES, Solr), but to answer your last question about performance, I feel Druid will be the most performant with minimal resource requirement for your use case.

Though ES supports analytics and aggregations, it's primary design is based on free text search requirement. As ES does more things than your requirement mentioned above, it will use resources and may not be the right fit unless you want to do more than just the distinct count.

Quoting from Druid's website https://druid.apache.org/docs/latest/comparisons/druid-vs-elasticsearch.html

Druid focuses on OLAP workflows. Druid is optimized for high performance (fast aggregation and ingestion) at low cost and supports a wide range of analytic operations.

Alexis Wilke
  • 19,179
  • 10
  • 84
  • 156
avp
  • 2,892
  • 28
  • 34