1

Basically, I want to use iterrows method to loop through my group-by dataframe, but I can't figure out how the columns work. In the example below, it does not create a column Called "Group1" and "Group2" like one might expect. One of the columns is a dtype itself?

import pandas as pd

df = pd.DataFrame(columns=["Group1", "Group2", "Amount"])

df = df.append({"Group1": "Apple", "Group2": "Red Delicious", "Amount": 15}, ignore_index=True)
df = df.append({"Group1": "Apple", "Group2": "McIntosh", "Amount": 20}, ignore_index=True)
df = df.append({"Group1": "Apple", "Group2": "McIntosh", "Amount": 30}, ignore_index=True)
df = df.append({"Group1": "Apple", "Group2": "Fuju", "Amount": 7}, ignore_index=True)
df = df.append({"Group1": "Orange", "Group2": "Navel", "Amount": 9}, ignore_index=True)
df = df.append({"Group1": "Orange", "Group2": "Navel", "Amount": 5}, ignore_index=True)
df = df.append({"Group1": "Orange", "Group2": "Mandarin", "Amount": 12}, ignore_index=True)
print(df.dtypes)
print(df.to_string())

df_sum = df.groupby(['Group1', 'Group2']).sum(['Amount'])
print("---- Sum Results----")
print(df_sum.dtypes)
print(df_sum.to_string())

for index, row in df_sum.iterrows():
    # The line below is what I want to do conceptually. 
    # print(row.Group1, row.Group2. row.Amount) # 'Series' object has no attribute 'Group1'
    print(row.Amount)  # 'Series' object has no attribute 'Group1'

The part of the output we are interested is here. I noticed that "Group1 and Group2" are on a lin below the Amount.

---- Sum Results----
Amount    int64
dtype: object
                      Amount
Group1 Group2               
Apple  Fuju                7
       McIntosh           50
       Red Delicious      15
Orange Mandarin           12
       Navel              14
NealWalters
  • 17,197
  • 42
  • 141
  • 251
  • They are the converted to an `index` by default. They are no longer columns in the output. Behaviour can be avoided by passing `as_index=False` to `groupby`. -> `df_sum = df.groupby(['Group1', 'Group2'], as_index=False)['Amount'].sum()` – Henry Ecker Jun 25 '21 at 16:01
  • `iterrows` is generally discouraged, but to answer your question your groupby object returns a new dataframe where `Group1 + 2` are now the indices of said dataframe. What are you trying to do ? – Umar.H Jun 25 '21 at 16:02
  • I want to loop through the groupby and then do logic, which might include lookups or matching on the values of Group1 and Group2. – NealWalters Jun 25 '21 at 16:03
  • 2
    df_sum = df.groupby(['Group1', 'Group2']).sum(['Amount']).reset_index() – BENY Jun 25 '21 at 16:04
  • @BENY - That's exactly what I needed, you want to put as answer? I'm basically just using dataframe almost like SQL, but reading in various CSV files. – NealWalters Jun 25 '21 at 16:06
  • 2
    By the way, `df.groupby(["Group1", "Group2"]).sum(["Amount"])` isn't really doing what you meant; that should read `df.groupby(["Group1", "Group2"])["Amount"].sum()`, i.e., choosing the column after groupby and them sum. The former also happened to work because `GroupBy.sum`'s first argument is `numeric_only` parameter which evaluates what's passed as a boolean and a non-empty list is `True` for that. Its default is also `True` so it is as if you didn't pass anything to `sum`. Since the only numeric column left after groups is `Amount` you get the meant output incidentally. – Mustafa Aydın Jun 25 '21 at 16:07
  • Does this answer your question? [Converting a Pandas GroupBy output from Series to DataFrame](https://stackoverflow.com/questions/10373660/converting-a-pandas-groupby-output-from-series-to-dataframe) – Henry Ecker Jun 25 '21 at 16:07
  • 1
    Specifically [this answer](https://stackoverflow.com/a/32307259/15497888) – Henry Ecker Jun 25 '21 at 16:08
  • Oh Wow, that's big news. I thought groupby was returning a Dataframe, but it's returning a series which I have not yet really studied or played with. – NealWalters Jun 25 '21 at 16:10
  • 1
    @NealWalters Actually it's returning neither... Please see [here](https://pandas.pydata.org/docs/user_guide/groupby.html) or [here](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html#GroupBy:-Split,-Apply,-Combine). – Mustafa Aydın Jun 25 '21 at 16:16
  • I would recommend looking at some of the core pandas functions, using vanilla loops is an anti pattern in pandas, think of it like trying to use Excel to do SQL, pandas has joins that you can use for looksups (the syntax is `pd.merge`) – Umar.H Jun 25 '21 at 16:23
  • @NealWalters, Even i see you can simply use `df_sum = df.groupby(['Group1', 'Group2']).sum().reset_index()` this will produce the same desired outcome. – Karn Kumar Jun 25 '21 at 16:25

2 Answers2

2

Simply try:

df_sum = df.groupby(['Group1', 'Group2'])['Amount'].sum().reset_index()

OR

df_sum = df.groupby(['Group1', 'Group2'])['Amount'].agg('sum').reset_index()

Even, it Simply can be ad follows, as we are performing the sum based on the Group1 & Group2 only.

df_sum = df.groupby(['Group1', 'Group2']).sum().reset_index()

Another way:

df_sum = df.groupby(['Group1', 'Group2']).agg({'Amount': 'sum'}).reset_index()
Karn Kumar
  • 8,518
  • 3
  • 27
  • 53
  • 1
    The last one is nice way providing `dict` method for columns with summation. – user2023 Jun 25 '21 at 16:45
  • @kulfi, yes it provided different parameters like `function, str, list or dict` , more details can be taken from pandas org document. – Karn Kumar Jun 25 '21 at 16:47
1

Try to reset_index

df_sum = df.groupby(['Group1', 'Group2']).sum(['Amount']).reset_index()
BENY
  • 317,841
  • 20
  • 164
  • 234