Combine duplicate rows on specific column

Question

I am trying to combine rows of a dataframe in the event that there is a duplicate in one column. The dataframe looks like the following.

Name   Code   X   Y
 A     123   10   11
 B     456   12   13
 C     123   15   16

I want to combine on Code. So if the Code is the same, combine the other data separated by a comma. The resulting df would look like this:

Name   Code    X       Y
A,C    123   10,15   11,16
 B     456    12       13

My approach was the following:

    df = df.groupby(['Name','Code','Y'])['X'].astype(str).apply(', '.join).reset_index() 

    df = df.groupby(['Name','Code','X'])['Y'].astype(str).apply(', '.join).reset_index()

I get the following error :

"Cannot access callable attribute 'astype' of 'SeriesGroupBy' objects, try using the 'apply' method"

I have been unable to figure out how to use apply to cast as type str, any tips?

jezrael · Answer 1 · 2019-03-12T14:28:04.973

5

Create index from Code column fo avoid casting to strings, then cast all columns and aggregate by index function join:

df = df.set_index('Code').astype(str).groupby(level=0).agg(', '.join).reset_index()
#pandas 0.24+
#df = df.set_index('Code').astype(str).groupby('Code').agg(', '.join).reset_index()
print (df)
   Code  Name       X       Y
0   123  A, C  10, 15  11, 16
1   456     B      12      13

edited Mar 12 '19 at 14:28

answered Mar 12 '19 at 14:23

jezrael

822,522
95
1,334
1,252

score 3 · Accepted Answer · answered Mar 12 '19 at 14:26

3

Another layout

(df.set_index('Code').astype(str)+',').sum(level=0).apply(lambda x : x.str.strip(','))
Out[50]: 
     Name      X      Y
Code                   
123   A,C  10,15  11,16
456     B     12     13

answered Mar 12 '19 at 14:26

BENY

317,841
20
164
234

Why would this cause a number like '0.007' to become '0.00699999999' after running it? For some reason every solution offered changes the numbers after combining them. – MaxB Mar 13 '19 at 14:01
1

@MaxB that is the problem with pandas , if that is the case , format you float number first :-) – BENY Mar 13 '19 at 14:02
is there something I should do before calling this? Or attempt to round the numbers after? Rounding doesn't seem like it will solve my problem completely. – MaxB Mar 13 '19 at 14:03
1

@MaxB I mean something like df['float col']=df['float col'].map('${:,.3f}'.format) – BENY Mar 13 '19 at 14:05
Hmm. I then get the error 'Unknown format code 'f' for object of type 'str'' – MaxB Mar 13 '19 at 14:08
1

@MaxB let me show your the link . :-) https://stackoverflow.com/questions/20937538/how-to-display-pandas-dataframe-of-floats-using-a-format-string-for-columns – BENY Mar 13 '19 at 14:10

Sokolokki · Answer 3 · 2019-03-12T14:53:37.760

1

More general solution:

import pandas as pd
df = pd.DataFrame([['A',123,10,11],['B',456,12,13],['C',123,15,16]],columns=['Name','Code','X','Y'])

def f(x):
    return ','.join(x)

df = df.astype(str).groupby('Code').agg(f)
df.index = [int(i) for i in df.index.tolist()]

The last line changes the index from string back to integer type. I added this solution because it is easy to understand, however not the most elegant.

edited Mar 12 '19 at 14:53

answered Mar 12 '19 at 14:29

Sokolokki

833
1
9
19

This is my original solution, problem is column `Code` is casted to strings – jezrael Mar 12 '19 at 14:31
df.index = [int(i) for i in df.index.tolist()] – Sokolokki Mar 12 '19 at 14:36

Combine duplicate rows on specific column

3 Answers3