3

I am trying to combine rows of a dataframe in the event that there is a duplicate in one column. The dataframe looks like the following.

Name   Code   X   Y
 A     123   10   11
 B     456   12   13
 C     123   15   16

I want to combine on Code. So if the Code is the same, combine the other data separated by a comma. The resulting df would look like this:

Name   Code    X       Y
A,C    123   10,15   11,16
 B     456    12       13

My approach was the following:

    df = df.groupby(['Name','Code','Y'])['X'].astype(str).apply(', '.join).reset_index() 

    df = df.groupby(['Name','Code','X'])['Y'].astype(str).apply(', '.join).reset_index() 

I get the following error :

"Cannot access callable attribute 'astype' of 'SeriesGroupBy' objects, try using the 'apply' method"

I have been unable to figure out how to use apply to cast as type str, any tips?

MaxB
  • 428
  • 1
  • 8
  • 24

3 Answers3

5

Create index from Code column fo avoid casting to strings, then cast all columns and aggregate by index function join:

df = df.set_index('Code').astype(str).groupby(level=0).agg(', '.join).reset_index()
#pandas 0.24+
#df = df.set_index('Code').astype(str).groupby('Code').agg(', '.join).reset_index()
print (df)
   Code  Name       X       Y
0   123  A, C  10, 15  11, 16
1   456     B      12      13
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
3

Another layout

(df.set_index('Code').astype(str)+',').sum(level=0).apply(lambda x : x.str.strip(','))
Out[50]: 
     Name      X      Y
Code                   
123   A,C  10,15  11,16
456     B     12     13
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Why would this cause a number like '0.007' to become '0.00699999999' after running it? For some reason every solution offered changes the numbers after combining them. – MaxB Mar 13 '19 at 14:01
  • 1
    @MaxB that is the problem with pandas , if that is the case , format you float number first :-) – BENY Mar 13 '19 at 14:02
  • is there something I should do before calling this? Or attempt to round the numbers after? Rounding doesn't seem like it will solve my problem completely. – MaxB Mar 13 '19 at 14:03
  • 1
    @MaxB I mean something like df['float col']=df['float col'].map('${:,.3f}'.format) – BENY Mar 13 '19 at 14:05
  • Hmm. I then get the error 'Unknown format code 'f' for object of type 'str'' – MaxB Mar 13 '19 at 14:08
  • 1
    @MaxB let me show your the link . :-) https://stackoverflow.com/questions/20937538/how-to-display-pandas-dataframe-of-floats-using-a-format-string-for-columns – BENY Mar 13 '19 at 14:10
1

More general solution:

import pandas as pd
df = pd.DataFrame([['A',123,10,11],['B',456,12,13],['C',123,15,16]],columns=['Name','Code','X','Y'])

def f(x):
    return ','.join(x)

df = df.astype(str).groupby('Code').agg(f)
df.index = [int(i) for i in df.index.tolist()] 

The last line changes the index from string back to integer type. I added this solution because it is easy to understand, however not the most elegant.

Sokolokki
  • 833
  • 1
  • 9
  • 19