-1

I would like to get statistics on my dataset,

for example this is my dataset:

FirstName LastName Country  City    BirthMonth

Donald    Trump    England  London  Jan
Bill      Gates    England  London  Sep
Donald    Suther   England  York    Sep
Donald    Suther   Germany  Berlin  Jan

and this is my 'group_by' list:

[['FirstName', 'LastName'], ['Country', 'City'], ['BirthMonth']]

I would like to get the following statistics:

group by FirstName & LastName:

FirstName LastName Count

Donald    Trump    1    
Donald    Suther   2    
Bill      Gates    1

group by Country & City:

Country City    Count

England London  2
England York    1
Germany Berlin  1

group by BirthMonth:

BirthMonth Count 

Jan        2    
Sep        2      

my query would look like this:

select FirstName, LastName, Country, City, BirthMonth
from my_table
where <some conditions to filter rows only from certain timestamp>

now I have two options:

a. return all values to server and process there (I'm using python) - this query includes many rows and causes overhead

b. query multiple times, each time grouping by the specific fields

select FirstName, LastName, count(*) as group_by
from my_table
where ...
group by FirstName, LastName

c. is there a third option of having one query, and return all different 'group_by's?

and another question:
from (a) and (b) which is better?
I'll note that the group by has limited options - for example 'FirstName' has only 20 options, this means that (b) will result each time less than 20*(num of group by in query rows) < 40, but (a) will result 20^5 rows which is tons of data

to simplify I'll assume there aren't more than 2 columns in a group_by each time, (which is actually the current situation, it might grow in the future but currently I can use a solution that takes that into account)

dina
  • 4,039
  • 6
  • 39
  • 67
  • How can you expect to do this in a single query? You have different numbers of columns in some rows. – Barmar Aug 22 '21 at 06:35
  • 1
    Did you mean to say that `(a)` will return 20^5 rows? – Barmar Aug 22 '21 at 06:38
  • 2
    A general rule of using databases is to try to do as much of the processing as possible in the database, to minimize the amount of results that has to be sent to the client. So (b) is usually the preferred way. – Barmar Aug 22 '21 at 06:39
  • But for any specific application, you should benchmark the different ways to see which is best for you. Maybe fetch the data into pandas and do the grouping there (if you're using Python). – Barmar Aug 22 '21 at 06:41
  • I added that my group by won't have more than 2 columns , isn't there a way to have a result that contains 3 columns `(group_by_col1, group_by_col2, cnt)` and if has only one col, `group_by_col2` will be null? – dina Aug 22 '21 at 06:42
  • So even if you have the same number of columns, the column names will be wrong if you combine them all with `UNION`. – Barmar Aug 22 '21 at 06:43
  • if I would have only one column in each group by there would be a way to do that? if so, how? – dina Aug 22 '21 at 06:43
  • @Barmar I am using python and can use pandas, but from my experience loading from the db large data causes overhead in the db, are you sure it is better to load all the data, than using three queries, when each one results a small amount of data? – dina Aug 22 '21 at 06:45
  • @Barmar. Wouldn't be doing a lot of processing in the database make it impossible to scale horizontally in the future? – Soheil Rahsaz Aug 22 '21 at 07:51
  • @SoheilRahsaz Possibly, although the overhead of returning all the data to the client machines so they can then distribute it to all the nodes may negate the benefits of horizontal scaling. – Barmar Aug 22 '21 at 07:54
  • 1
    The basic point is that communication is usually orders of magnitude slower than processing, so you usually need to minimize it. – Barmar Aug 22 '21 at 07:56

1 Answers1

1

You can combine each grouped query with the others using UNION. Fill in the unused columns in each subquery with NULL.

SELECT FirstName, LastName, NULL AS Country, NULL AS City, NULL AS BirthMonth, COUNT(*) AS count
FROM my_table
GROUP BY FirstName, LastName

UNION ALL

SELECT NULL, NULL, Country, City, NULL, COUNT(*)
FROM my_table
GROUP BY Country, City

UNION ALL

SELECT NULL, NULL, NULL, NULL, BirthMonth, COUNT(*)
FROM my_table
GROUP BY BirthMonth
Barmar
  • 741,623
  • 53
  • 500
  • 612
  • so I'm doing here three queries and uniting, I guess a better way is to use a sub named query – dina Aug 22 '21 at 06:51
  • As I said in a comment, they only real way to tell what's best is with a benchmark. – Barmar Aug 22 '21 at 06:53
  • 1
    I doubt there will be much difference between this and doing 3 separate queries. And processing the results will be easier, because you won't have to figure out which subgroup each row is in. – Barmar Aug 22 '21 at 06:54