1

I have a sample dataset:

import pandas as pd
df = {'ID': ['H1','H2','H3','H4','H5','H6'],
      'AA1': ['C','B','B','X','G','G'],
      'AA2': ['W','K','K','A','B','B'],
      'name':['n1','n2','n3','n4','n5','n6']
}

df = pd.DataFrame(df)

it looks like :

df
Out[32]: 
   AA1 AA2  ID name
0   C   W  H1   n1
1   B   K  H2   n2
2   B   K  H3   n3
3   X   A  H4   n4
4   G   B  H5   n5
5   G   B  H6   n6

I want to groupby AA1 and AA2 (unique AA1 and AA2 pair) and it doesn't matter which ID and name values the unique pair picks along with it, and output that to a .csv file, so the output in the .csv file would look like:

 AA1 AA2  ID name
  C   W  H1   n1
  B   K  H2   n2
  X   A  H4   n4
  G   B  H5   n5

i tried the code:

df.groupby('AA1','AA2').apply(to_csv('merged.txt', sep = '\t', index=False))

but the to_csv was not recognized, what can i put in the .apply() to just output the groupby results to a csv file?

Jessica
  • 2,923
  • 8
  • 25
  • 46
  • So you just want the first row of each unique `AA1`, `AA2` pair? – evan.oman Nov 30 '16 at 22:22
  • The behavior you indicated is not a groupby operation. Are you just keeping the first occurrence of a unique AA1-AA2 pair? Or do you have to aggregate within each pair somehow? – 3novak Dec 01 '16 at 02:15
  • just keep the first occurrence of a unique AA1 -AA2 pair – Jessica Dec 01 '16 at 13:45

2 Answers2

3

The problem is that you are trying to apply a function to_csv which doesn't exist. Anyway, groupby also doesn't have a to_csv method. pd.Series and pd.DataFrame do.

What you should really use is drop_duplicates here and then export the resulting dataframe to csv:

df.drop_duplicates(['AA1','AA2']).to_csv('merged.txt')

PS: If you really wanted a groupby solution, there's this one that happens to be 12 times slower than drop_duplicates...:

df.groupby(['AA1','AA2']).agg(lambda x:x.value_counts().index[0]).to_csv('merged.txt')
Julien Marrec
  • 11,605
  • 4
  • 46
  • 63
2

you can use groupby with head

df.groupby(['AA1', 'AA2']).head(1)

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624