11

Sample Pandas Dataframe:

ID Name COMMENT1 COMMENT2 NUM
1  dan  hi       hello    1
1  dan  you      friend   2
3  jon  yeah     nope     3
2  jon  dog      cat      .5
3  jon  yes      no       .1

I am trying to create a dataframe that groups by ID and NAME that concatenates COMMENT1 and COMMENT2 that also sums NUM.

This is what I'm looking for:

ID Name COMMENT1     COMMENT2        NUM
1  dan  hi you       hello friend    3
3  jon  yeah yes     nope no         3.1
2  jon  dog          cat             .5

I tried using this:

input_df = input_df.groupby(['ID', 'NAME', 'COMMENT1', 'COMMENT2']).sum().reset_index()

But it doesn't work.

If I use this:

input_df = input_df.groupby(['ID']).sum().reset_index()

It sums the NUM column but leaves out all other columns.

kevintrankt
  • 139
  • 1
  • 3
  • 10
  • Possible duplicate of [Pandas groupby: How to get a union of strings](https://stackoverflow.com/questions/17841149/pandas-groupby-how-to-get-a-union-of-strings) - the accepted answer there shows how to use a lambda to get what you want – Patrick Artner Dec 01 '17 at 20:15

3 Answers3

18

Let us make it into one line

df.groupby(['ID','Name'],as_index=False).agg(lambda x : x.sum() if x.dtype=='float64' else ' '.join(x))
Out[1510]: 
   ID Name  COMMENT1      COMMENT2  NUM
0   1  dan    hi you  hello friend  3.0
1   2  jon       dog           cat  0.5
2   3  jon  yeah yes       nope no  3.1
BENY
  • 317,841
  • 20
  • 164
  • 234
4

You can also just tell .agg() which aggregator functions to use for each column, and for the string columns, pass ' '.join (notice there're no parenthesis since you don't want to call .join but rather pass it as the argument itself):

df.groupby(['ID','Name'],as_index=False).agg({'COMMENT1': ' '.join, 'COMMENT2': ' '.join, 'NUM': 'sum'})

hamx0r
  • 4,081
  • 1
  • 33
  • 46
-1

Converting your data example into a csv file, we can do the following:

import pandas as pd

def grouping_Cols_by_Cols(DF, grouping_Columns, num_Columns):
    # numerical columns can mess us up ...
    column_Names = DF.columns.tolist()
    # so, convert all columns' values to strings
    for column_Name in column_Names:
        DF[column_Name] = DF[column_Name].map(str) + ' '
    DF = DF.groupby(by=grouping_Columns).sum()

    # NOW, convert the numerical string columns to an expression ...
    for num_Col in num_Columns:
        column_Names = DF.columns.tolist()
        num_Col_i = column_Names.index(num_Col)
        for i in range(len(DF)):
            String = DF[num_Col].iloc[i] 
            value = eval(String.rstrip(' ').replace(' ','+'))
            DF.iat[i,num_Col_i] = value

    return DF

###############################################################
### Operations Section
###############################################################

df = pd.read_csv("UnCombinedData.csv")

grouping_Columns = ['ID','Name']
num_Columns = ['NUM']
df = grouping_Cols_by_Cols(df,grouping_Columns, num_Columns)

print df

With a little more work, the defined function could auto detect, which columns have numbers in them and add them to a numerical columns list.

I think this is similar, but not exact, to problems and challenges encountered in this post.

Thom Ives
  • 3,642
  • 3
  • 30
  • 29