We are using Apache Druid to do time series based analytics on a dataset and are stuck trying to calculate mode for one of the analytics.
Statistical mode tells us about the data point that is most frequently repeated in the dataset.
Example: Given a dataset like below, try to Calculate the most representative sale bucket for each product
Dataset:
Product ID Month Sale Bucket
Pid1 January 10-20
Pid1 February 0-10
Pid1 March 10-20
Pid2 January 0-10
Pid2 February 10-20
Pid2 March 0-10
Expected output:
Product ID Sale Bucket
Pid1 10-20
Pid2 0-10
Tried a simple google search to find if there was something in-built or an extension to do the same. Couldn't find one.
Then I started trying my own naive approach to perform the mode over this dataset using below steps:
Step 1. Getting the relevant frequencies of the data point (Group By over Product ID and Sale bucket) Resulting in
Product ID Sale Bucket Frequency
Pid1 10-20 2
Pid1 0-10 1
Pid2 10-20 1
Pid2 0-10 2
Step 2. Getting the relevant data point with maximum Frequency per Product (out of the Frequencies calculated in 1). I am stuck at this point 2, having tried 2 approaches:
- Concatenate the column to be chosen as a string and then remove it later (Explain here on stackoverflow: SQL Query to get column values that correspond with MAX value of another column?. But druid doesn't seem to support a MAX/MIN operation on Strings. :-/
- Trying a rank based approach to rank the items by frequency and then pick the highest ranks in a further select query. Druid doesn't support Rank operator. :-/
Does anyone have any ideas on how to proceed further?