This question follows my previous question output multiple files based on column value python pandas but this time i want to go a bit further.
so this time i have a small sample data set:
import pandas as pd
df = {'ID': ['H900','H901','H902','M1436','M1435','M149','M157','M213','M699','M920','M871','M789','M617','M991','H903','M730','M191'],
'CloneID': [0,1,2,2,2,2,2,2,3,3,3,4,4,4,5,5,6],
'Length': [48,42 ,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48]}
df = pd.DataFrame(df)
it looks like:
df
Out[6]:
CloneID ID Length
0 0 H900 48
1 1 H901 42
2 2 H902 48
3 2 M1436 48
4 2 M1435 48
5 2 M149 48
6 2 M157 48
7 2 M213 48
8 3 M699 48
9 3 M920 48
10 3 M871 48
11 4 M789 48
12 4 M617 48
13 4 M991 48
14 5 H903 48
15 5 M730 48
16 6 M191 48
I want to output each 'cloneID' to a different output file, but this time ONLY the ones that contains IDs that starts with "H".
so my desired output, 4 output files:
first file would be 'cloneID0.txt'
CloneID ID Length
0 H900 48
second file would be 'CloneID1.txt'
CloneID ID Length
1 H901 42
third file would be 'CloneID2.txt'
CloneID ID Length
2 H902 48
2 M1436 48
2 M1435 48
2 M149 48
2 M157 48
2 M213 48
second file would be 'CloneID5.txt'
CloneID ID Length
5 H903 48
5 M730 48
so there would be no 'CloneID3.txt', 'CloneID4.txt' and 'CloneID6.txt' because those clones do not have any ID that starts with"H".
my code:
import pandas as pd
data = pd.read_csv('data.txt', sep = '\t')
gp = data.groupby('CloneID')
for g in gp.groups:
for s in data.ID:
if s.startswith("H"):
path = 'IgHCloneID' + str(g) + '.xlsx'
gp.get_group(g).to_excel(path, index=False)
It still gave all clone files, not just the ones that contains the IDs that starts with "H".