1

What I'm trying to do is replicate this SQL code to Python:

select column_1, column_2, column_3, 
       sum(column_4) as sum_column_4, sum(column_5) as sum_column_5
from df 
group by 1,2,3;

In other words, I need to make this data frame:

column_1   column_2   colunn_3   column_4   column_5
AA         BB         CC         5          3
AA         BB         CC         5          0
AA         BB         CC         7          3
AA         DD         EE         5          2
AA         DD         EE         7          1
DD         EE         FF         2          8
DD         EE         FF         1          0

Look like this:

column_1   column_2   colunn_3   sum_column_4   sum_column_5
AA         BB         CC         17             6
AA         DD         EE         12             3
DD         EE         FF         3              8

Also, I'm trying to make this as simple as possible, because I actually have a lot of columns. And I need to have a new Pandas data frame as output. So this is what I've tried:

df.groupby(list(df.columns)[0:3]).sum()

It's almost there, the problem is that the output gets weird, something like:

column_1   column_2   colunn_3   sum_column_4   sum_column_5
AA         BB         CC         17             6
           DD         EE         12             3
DD         EE         FF         3              8

I'm trying different things that I've seen in other posts, like Pandas DataFrame Groupby two columns and get counts and Python Pandas group by multiple columns, mean of another - no group by object, but that didn't work. So if anyone could help me.

Dumb ML
  • 357
  • 2
  • 12
  • `df.groupby(['column_1','column_2','column_3']).sum()`. – Quang Hoang Apr 29 '20 at 20:31
  • @QuangHoang the problem with this is that the output is a little weird. For instance, if I have AA + BB + CC in one line and AA + DD + EE in another, the column 1 will not be filled with "AA" in the second row, just for the first. That's why I used SQL as example of what I'm replicating – Dumb ML Apr 29 '20 at 20:33
  • @QuangHoang I've updated the question with the problem – Dumb ML Apr 29 '20 at 20:35
  • You see the missing label because they are index. chain that with `reset_index()` – Quang Hoang Apr 29 '20 at 20:35
  • @QuangHoang it worked! Many thanks! – Dumb ML Apr 29 '20 at 20:38

1 Answers1

0

Like @Quang mentioned in comments, you need to reset indexing:

df.groupby(list(df.columns)[0:3]).sum().reset_index()

When you groupby multiple columns at once, you create a hierarchical multi-indexing and that is why you see column_1 groups index AA.

output:

  column_1 column_2 colunn_3  column_4  column_5
0       AA       BB       CC        17         6
1       AA       DD       EE        12         3
2       DD       EE       FF         3         8
Ehsan
  • 12,072
  • 2
  • 20
  • 33