1

This question follows my previous question output multiple files based on column value python pandas but this time i want to go a bit further.

so this time i have a small sample data set:

import pandas as pd

df = {'ID': ['H900','H901','H902','M1436','M1435','M149','M157','M213','M699','M920','M871','M789','M617','M991','H903','M730','M191'],
  'CloneID': [0,1,2,2,2,2,2,2,3,3,3,4,4,4,5,5,6],
  'Length': [48,42  ,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48]}

df = pd.DataFrame(df)

it looks like:

df
Out[6]: 
    CloneID   ID  Length
0       0   H900      48
1       1   H901      42
2       2   H902      48
3       2   M1436     48
4       2   M1435     48
5       2   M149      48
6       2   M157      48
7       2   M213      48
8       3   M699      48
9       3   M920      48
10      3   M871      48
11      4   M789      48
12      4   M617      48
13      4   M991      48
14      5   H903      48
15      5   M730      48
16      6   M191      48

I want to output each 'cloneID' to a different output file, but this time ONLY the ones that contains IDs that starts with "H".

so my desired output, 4 output files:

first file would be 'cloneID0.txt'

    CloneID   ID  Length
      0      H900      48

second file would be 'CloneID1.txt'

    CloneID   ID  Length
      1      H901      42

third file would be 'CloneID2.txt'

    CloneID   ID  Length
       2     H902      48
       2     M1436     48
       2     M1435     48
       2     M149      48
       2     M157      48
       2     M213      48

second file would be 'CloneID5.txt'

    CloneID   ID  Length
      5     H903      48
      5     M730      48

so there would be no 'CloneID3.txt', 'CloneID4.txt' and 'CloneID6.txt' because those clones do not have any ID that starts with"H".

my code:

import pandas as pd
data = pd.read_csv('data.txt', sep = '\t')
gp = data.groupby('CloneID')
for g in gp.groups:
    for s in data.ID:
        if s.startswith("H"):
           path = 'IgHCloneID' + str(g) + '.xlsx'
           gp.get_group(g).to_excel(path, index=False)

It still gave all clone files, not just the ones that contains the IDs that starts with "H".

Community
  • 1
  • 1
Jessica
  • 2,923
  • 8
  • 25
  • 46

2 Answers2

3

You can first filter by condition in any value in column ID startswith 'H' and last groupby with to_csv:

df1 = (df.groupby('CloneID').filter(lambda x: (x.ID.str.startswith("H").any())))

df1.groupby('CloneID').apply(lambda x: x.to_csv('CloneID{}.txt'.format(x.name), index=False))
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

You can groupby CloneID and write to csv directly in the apply method:

df.groupby('CloneID').apply(lambda gp: gp.to_csv('CloneID{}.txt'.format(gp.name)))

This will retain the original index, but it can be fixed by .set_index('CloneID') before to_csv call.

Edit: To retain only the groups where corresponding ID starts with H:

This requires a check for each group; here is one approach:

df.groupby('CloneID').apply(
    lambda gp: gp.to_csv('CloneID{}.txt'.format(gp.name))
    if any(gp.ID.str.startswith('H'))
    else None)
hilberts_drinking_problem
  • 11,322
  • 3
  • 22
  • 51