I would like to get statistics on my dataset,
for example this is my dataset:
FirstName LastName Country City BirthMonth
Donald Trump England London Jan
Bill Gates England London Sep
Donald Suther England York Sep
Donald Suther Germany Berlin Jan
and this is my 'group_by' list:
[['FirstName', 'LastName'], ['Country', 'City'], ['BirthMonth']]
I would like to get the following statistics:
group by FirstName & LastName:
FirstName LastName Count
Donald Trump 1
Donald Suther 2
Bill Gates 1
group by Country & City:
Country City Count
England London 2
England York 1
Germany Berlin 1
group by BirthMonth:
BirthMonth Count
Jan 2
Sep 2
my query would look like this:
select FirstName, LastName, Country, City, BirthMonth
from my_table
where <some conditions to filter rows only from certain timestamp>
now I have two options:
a. return all values to server and process there (I'm using python) - this query includes many rows and causes overhead
b. query multiple times, each time grouping by the specific fields
select FirstName, LastName, count(*) as group_by
from my_table
where ...
group by FirstName, LastName
c. is there a third option of having one query, and return all different 'group_by's?
and another question:
from (a) and (b) which is better?
I'll note that the group by has limited options - for example 'FirstName' has only 20 options, this means that (b) will result each time less than 20*(num of group by in query rows) < 40, but (a) will result 20^5 rows which is tons of data
to simplify I'll assume there aren't more than 2 columns in a group_by
each time,
(which is actually the current situation, it might grow in the future but currently I can use a solution that takes that into account)