Python Pandas: Groupby Sum AND Concatenate Strings

Question

Sample Pandas Dataframe:

ID Name COMMENT1 COMMENT2 NUM
1  dan  hi       hello    1
1  dan  you      friend   2
3  jon  yeah     nope     3
2  jon  dog      cat      .5
3  jon  yes      no       .1

I am trying to create a dataframe that groups by ID and NAME that concatenates COMMENT1 and COMMENT2 that also sums NUM.

This is what I'm looking for:

ID Name COMMENT1     COMMENT2        NUM
1  dan  hi you       hello friend    3
3  jon  yeah yes     nope no         3.1
2  jon  dog          cat             .5

I tried using this:

input_df = input_df.groupby(['ID', 'NAME', 'COMMENT1', 'COMMENT2']).sum().reset_index()

But it doesn't work.

If I use this:

input_df = input_df.groupby(['ID']).sum().reset_index()

It sums the NUM column but leaves out all other columns.

Possible duplicate of [Pandas groupby: How to get a union of strings](https://stackoverflow.com/questions/17841149/pandas-groupby-how-to-get-a-union-of-strings) - the accepted answer there shows how to use a lambda to get what you want — Patrick Artner, Dec 01 '17 at 20:15

score 18 · Accepted Answer · answered Dec 01 '17 at 20:21

18

Let us make it into one line

df.groupby(['ID','Name'],as_index=False).agg(lambda x : x.sum() if x.dtype=='float64' else ' '.join(x))
Out[1510]: 
   ID Name  COMMENT1      COMMENT2  NUM
0   1  dan    hi you  hello friend  3.0
1   2  jon       dog           cat  0.5
2   3  jon  yeah yes       nope no  3.1

answered Dec 01 '17 at 20:21

BENY

317,841
20
164
234

1

if there's a NaN in the group this doesn't work, correct? – Yuca Aug 13 '18 at 14:50
@Yuca you mean the group key ? – BENY Aug 13 '18 at 14:52
if instead of 'cat' there was NaN, then it looks like the code wouldn't work, no? – Yuca Aug 13 '18 at 14:53
@Yuca you can replace the NaN to'NaN' for future adjust – BENY Aug 13 '18 at 14:54
@WeNYoBen, thank you. Does this preserve the order of the strings in the pandas dataframe column that is being concatenated? – bernando_vialli Jun 12 '19 at 18:53
@mkheifetz you can always adding `.reindex(columns=df.columns)` to make sure order is same as before – BENY Jun 12 '19 at 19:10

score 4 · Answer 2 · answered Jul 18 '19 at 23:55

You can also just tell .agg() which aggregator functions to use for each column, and for the string columns, pass ' '.join (notice there're no parenthesis since you don't want to call .join but rather pass it as the argument itself):

df.groupby(['ID','Name'],as_index=False).agg({'COMMENT1': ' '.join, 'COMMENT2': ' '.join, 'NUM': 'sum'})

score -1 · Answer 3 · answered Dec 01 '17 at 21:11

Converting your data example into a csv file, we can do the following:

import pandas as pd

def grouping_Cols_by_Cols(DF, grouping_Columns, num_Columns):
    # numerical columns can mess us up ...
    column_Names = DF.columns.tolist()
    # so, convert all columns' values to strings
    for column_Name in column_Names:
        DF[column_Name] = DF[column_Name].map(str) + ' '
    DF = DF.groupby(by=grouping_Columns).sum()

    # NOW, convert the numerical string columns to an expression ...
    for num_Col in num_Columns:
        column_Names = DF.columns.tolist()
        num_Col_i = column_Names.index(num_Col)
        for i in range(len(DF)):
            String = DF[num_Col].iloc[i] 
            value = eval(String.rstrip(' ').replace(' ','+'))
            DF.iat[i,num_Col_i] = value

    return DF

###############################################################
### Operations Section
###############################################################

df = pd.read_csv("UnCombinedData.csv")

grouping_Columns = ['ID','Name']
num_Columns = ['NUM']
df = grouping_Cols_by_Cols(df,grouping_Columns, num_Columns)

print df

With a little more work, the defined function could auto detect, which columns have numbers in them and add them to a numerical columns list.

I think this is similar, but not exact, to problems and challenges encountered in this post.

Python Pandas: Groupby Sum AND Concatenate Strings

3 Answers3

Linked

Related