1

I have a small sample dataset:

import pandas as pd


df = {'ID': ['H576','H577','H577','H578','H600', 'H700', 'H700'],
  'CD': ['AAAAAAA', 'BBBBB', 'CCCCCC','DDDDDD', 'EEEEEEE','FFFFFFF','GGGGGGG']}

df = pd.DataFrame(df)

it looks like:

df
Out[9]: 
        CD    ID
0  AAAAAAA  H576
1    BBBBB  H577
2   CCCCCC  H577
3   DDDDDD  H578
4  EEEEEEE  H600
5  FFFFFFF  H700
6  GGGGGGG  H700

For each ID that has more than one CD values, i want to save them to a separate file.

my desire output files:

H577.txt

  CD      ID
 BBBBB   H577
 CCCCCC  H577

H700.txt

  CD       ID
 FFFFFFF  H700
 GGGGGGG  H700

my attempt:

import pandas as pd


df = {'ID': ['H576','H577','H577','H578','H600', 'H700', 'H700'],
  'CD': ['AAAAAAA', 'BBBBB', 'CCCCCC','DDDDDD', 'EEEEEEE','FFFFFFF','GGGGGGG']}

df = pd.DataFrame(df)

df1 = (df.groupby('ID').filter(lambda x: ('if CD has more than one value for the same ID'.any())))

df1.groupby('ID').apply(lambda gp: gp.to_csv('ID{}.txt'.format(gp.name), sep='\t', index=False))

i am not sure how to code the 'if CD has more than one value of the same ID' part.

Jessica
  • 2,923
  • 8
  • 25
  • 46
  • This problem reduces to finding duplicates in a list for which a simple google search returns this: 'http://stackoverflow.com/questions/9835762/find-and-list-duplicates-in-python-list'. You have to apply that in the IDs list and from that point on, it's easy sailing. – Ma0 Jun 21 '16 at 14:58
  • @Ev. Kounis the answer from the post is for removing duplicates, in my case i don't want that, i'm still new at python and i am not sure how to fit the code in the lambda style either. – Jessica Jun 21 '16 at 15:01
  • It does not but i can fetch it for you. What do you want df1 to be? A construction like df containing the duplicates only ? – Ma0 Jun 21 '16 at 15:05
  • so i would need to write a function to get the duplicate sets, what should it return? and once i get the function how should fit it to the lambda code? – Jessica Jun 21 '16 at 15:07
  • why does it have to be a lambda code? – Ma0 Jun 21 '16 at 15:10

1 Answers1

1

try this:

(df.groupby('ID')
   .filter(lambda g: len(g) > 1)
   .groupby('ID')
   .apply(lambda gp: gp.to_csv('ID{}.txt'.format(gp.name), sep='\t', index=False))
)

Output

IDH577.txt:

CD  ID
BBBBB   H577
CCCCCC  H577

IDH700.txt:

CD  ID
FFFFFFF H700
GGGGGGG H700
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419