3

I am new to python. Currently, I have a dataset that contains more than 40 columns needed to be grouped and aggregated. I was wondering if python has a function similar to cube() function in SQL. If not, how can I have the expected output? Really appreciate it if someone could answer for me. Thanks!

Below Example that I simplified to 2 columns only (Country & Attribute A):

| CustomerID | Country |Attribute A|Sales| No.of product| No. of transaction|
| ---------- | --------|-----------|-----|--------------|-------------------|
| 1          | US      |A          |20   |2             |2                  |
| 2          | US      |B          |25   |3             |3                  |
|3           |CA       |A          |100  |10            |10                 |
|4           |CA       |B          |50   |5             |5                  |
|5           |UK       |A          |40   |4             |4                  |

Expected Output:

| Country|Attribute A|Sum of Sales|Total no. of product| Total no. of transaction| Total no. of customer|
|--------|-----------|------------|--------------------|-------------------------|----------------------|
|US      |(null)     |45          |5                   |5                        |2                     |
|CA      |(null)     |150         |15                  |155                      |2                     |
|UK      |(null)     |40          |4                   |4                        |1                     |
|(null)  |A          |160         |16                  |16                       |3                     |
|(null)  |B          |75          |8                   |8                        |2                     |
|US      |A          |20          |2                   |2                        |1                     |
|US      |B          |25          |3                   |3                        |1                     |
|CA      |A          |100         |10                  |10                       |1                     |
|CA      |B          |50          |5                   |5                        |1                     |
|UK      |A          |40          |4                   |4                        |1                     |
  • 2
    Could you describe what exactly is the cube function in SQL – Prats Feb 02 '22 at 14:38
  • @Prats The "expected output" gives a sense of what's going on. It's a "groupby" operation over multiple columns whereby the result (in this case the sum) is shown for every *subset* of values for the two columns of interest. The `US ... (null)` row, for instance, gives sums for over all entries with Country=US. It looks like the expected output is missing a `(null) (null)` row, which would correspond to the overall total. – Ben Grossmann Feb 02 '22 at 15:00
  • @BenGrossmann Yes. Sorry for missing that combination in the output. Do you have any idea how python can achieve this expected output with a simple function? I saw the comments, it seems it would be complicated to do that in python because my real dataset has much more columns than the dataset in this example. – aaah_ooooh9305 Feb 03 '22 at 03:24
  • I don't know of any such function. I've come up with a way to make this work in general, though. – Ben Grossmann Feb 03 '22 at 04:37

3 Answers3

3

Here's an extension of the other two answers that generalizes to arbitrarily many columns.

import pandas as pd
from itertools import combinations
    
def cube_sum(df,cols):
    dfs = []
    for n in range(len(cols),0,-1):
        for subset in combinations(cols,n):
            dfs.append(df.groupby(list(subset)).sum().reset_index())
    dfs.append(df.drop(cols, axis = 1).sum().to_frame().T)
    return pd.concat(dfs)


df = pd.DataFrame(
    {'CustomerID':[1,2,3,4,5],
    'Country':['US','US','CA','CA','UK'],
    'Attribute A':list("ABABA"),
    'Sales':[20,50,100,50,40],
    'No. of Product':[2,3,10,5,4],
    'No. of Transaction':[2,3,10,5,4]}).set_index('CustomerID')
all_groups = cube_sum(df,["Attribute A","Country"])

The result:

  Attribute A Country  Sales  No. of Product  No. of Transaction
0           A      CA    100              10                  10
1           A      UK     40               4                   4
2           A      US     20               2                   2
3           B      CA     50               5                   5
4           B      US     50               3                   3
0           A     NaN    160              16                  16
1           B     NaN    100               8                   8
0         NaN      CA    150              15                  15
1         NaN      UK     40               4                   4
2         NaN      US     70               5                   5
0         NaN     NaN    260              24                  24

In response to the comment:

import pandas as pd 
from itertools import combinations 

def cube_agg(df,cols):     
    dfs = []     
    for n in range(len(cols),0,-1):         
        for subset in combinations(cols,n):             
            dfs.append(df.groupby(list(subset))
                 .aggregate({'Population': 'mean', 'Area': ['min', 'max']})
                 .reset_index())
    dfs.append((df.drop(["Attribute A","Country"], axis = 1)
                 .aggregate({'Population': 'mean', 'Area': ['min', 'max']})
                 .T.stack()
                 .to_frame().T))
    return pd.concat(dfs)

df = pd.DataFrame(
    {'CustomerID':[1,2,3,4,5],
    'Country':['US','US','CA','CA','UK'],
    'Attribute A':list("ABABA"),
    'Population':[20,50,100,50,40],
    'Area':[2,3,10,5,4]}).set_index('CustomerID')
all_groups = cube_agg(df,["Attribute A","Country"])

Result:

   Area       Attribute A Country  Population
    max   min                            mean
0  10.0  10.0           A      CA  100.000000
1   4.0   4.0           A      UK   40.000000
2   2.0   2.0           A      US   20.000000
3   5.0   5.0           B      CA   50.000000
4   3.0   3.0           B      US   50.000000
0  10.0   2.0           A     NaN   53.333333
1   5.0   3.0           B     NaN   50.000000
0  10.0   5.0         NaN      CA   75.000000
1   4.0   4.0         NaN      UK   40.000000
2   3.0   2.0         NaN      US   35.000000
0  10.0   2.0         NaN     NaN   52.000000

The trickiest aspect of this is getting the bottom row (overall aggregate) to work as expected. Note that the columns seem to be arranged in alphabetical order, hence the area coming first.

Ben Grossmann
  • 4,387
  • 1
  • 12
  • 16
  • Hi @Ben Grossmann , is there any way to calucalte mean for **population** and MAX** and MIN for **area**. `import pandas as pd from itertools import combinations def cube_sum(df,cols): dfs = [] for n in range(len(cols),0,-1): for subset in combinations(cols,n): (df.groupby(list(subset)).aggregate({'population': 'mean', 'area': ['min', 'max']}).reset_index()) dfs.append(df.drop(cols, axis=1).aggregate({'population': 'mean', 'area': ['min', 'max']}).to_frame().T) return pd.concat(dfs)` – Ashutosh SIngh Jul 28 '23 at 10:46
  • @AshutoshSIngh See my latest edit. The final row of the result needs to be handled a bit differently in the case of multiple "aggregates" – Ben Grossmann Jul 29 '23 at 18:40
0

I have updated my answer based on Acccumulation's answer reset_index(). I also added sort=True to avoid getting warnings, FutureWarning: Sorting because non-concatenation axis is not aligned.. credit

import pandas as pd

data = {
    "Country": ["US", "US", "CA", "CA", "UK"],
    "Attribute A": ["A", "B", "A", "B", "A"],
    "Sales": [20, 25, 100, 50, 40],
    "No.of product": [2, 3, 10, 5, 4],
    "No. of transaction": [2, 3, 10, 5, 4],
}
df = pd.DataFrame(data)
country_null = df.groupby(["Attribute A"]).sum().reset_index()
attr_a_null = df.groupby(["Country"]).sum().reset_index()
both_groups = df.groupby(["Country", "Attribute A"]).sum().reset_index()
all_groups = both_groups.append(country_null, sort=True).append(attr_a_null, sort=True)

print(all_groups)
Attribute A Country  No. of transaction  No.of product  Sales
0           A      CA                  10             10    100
1           B      CA                   5              5     50
2           A      UK                   4              4     40
3           A      US                   2              2     20
4           B      US                   3              3     25
0           A     NaN                  16             16    160
1           B     NaN                   8              8     75
0         NaN      CA                  15             15    150
1         NaN      UK                   4              4     40
2         NaN      US                   5              5     45
A D
  • 585
  • 1
  • 7
  • 16
  • Thanks. So there should be no simple function to achieve the expected output? Actually, in my case, I have over 40 columns needed to be grouped. And the output would be much more complicated. Let's say I added one more column, Attribute B, to the sample dataset. there should have more combinations in the output. If I follow the code you suggested, I need to consider all possible outcomes and write them into df.groupby one by one? – aaah_ooooh9305 Feb 03 '22 at 03:21
0

There are some answers here: Pandas Pivot tables row subtotals

A D's answer can be modified to get output that more closely matches what you asked for:

country_null = df.groupby(["Attribute A"]).sum().reset_index()
attr_a_null = df.groupby(["Country"]).sum().reset_index()
both_groups = df.groupby(["Country", "Attribute A"]).sum().reset_index()
all_groups = both_groups.append(country_null).append(attr_a_null)
Acccumulation
  • 3,491
  • 1
  • 8
  • 12