I use Druid for monitoring events in my website. The data can be represented as follows:
event_id | country | user_id | event_type
================================================
1 | USA | id1 | visit
2 | USA | id2 | visit
1 | Canada | id3 | visit
3 | USA | id1 | click
1 | Canada | id4 | visit
3 | Canada | id3 | click
3 | USA | id2 | click
I also defined an aggregation for counting events. I made queries to Druid to present data for event_id=3 as follows:
Pay attention that the visits are not related to the event_id.
country | visits | clicks
===============================
USA | 4 | 2
Canada | 3 | 2
Currently I use two queries of topNResults with 2 different filters:
- event_type = visit -> to count visits per country regardless of the event id.
- event_id = 3
Of course my data is much larger than that and contains many countries. topNResults api must have threshold param that represents the max amount of results that I want to get as a response.
The problem is if my threshold is smaller than actual results, those two queries might not have the same countries results.
Currently, I merge the overlapping results in my server but I loose some countries results and I display less than my threshold although there are more results.
What can I do to optimize that I will always have the same countries for my threshold (without sending list of countries returned from the first query to the second filter - I tried it and it was very slow)?