Calculating statistical mode on Apache druid

Question

We are using Apache Druid to do time series based analytics on a dataset and are stuck trying to calculate mode for one of the analytics.

Statistical mode tells us about the data point that is most frequently repeated in the dataset.

Example: Given a dataset like below, try to Calculate the most representative sale bucket for each product

Dataset:

Product ID  Month   Sale Bucket
Pid1    January 10-20
Pid1    February    0-10
Pid1    March   10-20
Pid2    January 0-10
Pid2    February    10-20
Pid2    March   0-10

Expected output:

Product ID  Sale Bucket
Pid1    10-20
Pid2    0-10

Tried a simple google search to find if there was something in-built or an extension to do the same. Couldn't find one.

Then I started trying my own naive approach to perform the mode over this dataset using below steps:

Step 1. Getting the relevant frequencies of the data point (Group By over Product ID and Sale bucket) Resulting in

Product ID  Sale Bucket Frequency
Pid1    10-20   2
Pid1    0-10    1
Pid2    10-20   1
Pid2    0-10    2

Step 2. Getting the relevant data point with maximum Frequency per Product (out of the Frequencies calculated in 1). I am stuck at this point 2, having tried 2 approaches:

Concatenate the column to be chosen as a string and then remove it later (Explain here on stackoverflow: SQL Query to get column values that correspond with MAX value of another column?. But druid doesn't seem to support a MAX/MIN operation on Strings. :-/
Trying a rank based approach to rank the items by frequency and then pick the highest ranks in a further select query. Druid doesn't support Rank operator. :-/

Does anyone have any ideas on how to proceed further?

score 0 · Answer 1 · answered Mar 29 '20 at 12:44

The normal way to do this in a databases uses window functions, which are not supported in Druid. Something like the following may work:

with ps as (
      select product, salebucket, count(*) as cnt
      from t
      group by product, salebucket
     )
select ps.product, ps.salebucket
from ps join
     (select ps.product, max(cnt) as max_cnt
      from ps
      group by ps.product
     ) ps2
     on ps2.product = ps.product and ps2.max_cnt = ps.cnt;

I'm not 100% sure this meets the constraints of Druid SQL, but it might work for you.

This is something that I might have used in Sql databases and thought of doing this in Druid as well. The problem is Druid doesn't support joins as well. https://druid.apache.org/docs/latest/querying/joins.html — Jitender Kumar, Mar 29 '20 at 13:54

Calculating statistical mode on Apache druid

1 Answers1