I have a big text file downloaded from MS-DIAL metabolomics MSP spectral kit containing EI-MS, MS/MS
The file is opened as txt file of compounds that look like that:
NAME: C11H11NO5; PlaSMA ID-967
PRECURSORMZ: 238.0712
PRECURSORTYPE: [M+H]+
FORMULA: C11H11NO5
Ontology: Formula predicted
INCHIKEY:
SMILES:
RETENTIONTIME: 1.74
CCS: -1
IONMODE: Positive
COLLISIONENERGY:
Comment: Annotation level-3; PlaSMA ID-967; ID title-AC_Bulb_Pos-629; Max plant tissue-LE_Ripe_Pos
Num Peaks: 2
192.06602 53
238.0757 31
NAME: Malvidin-3,5-di-O-glucoside; PlaSMA ID-3141
PRECURSORMZ: 656.19415
PRECURSORTYPE: [M+H]+
FORMULA: C29H35O17
Ontology: Anthocyanidin O-glycosides
INCHIKEY: CILLXFBAACIQNS-UHFFFAOYNA-O
SMILES: COC1=CC(=CC(OC)=C1O)C1=C(OC2OC(CO)C(O)C(O)C2O)C=C2C(OC3OC(CO)C(O)C(O)C3O)=CC(O)=CC2=[O+]1
RETENTIONTIME: 2.81
CCS: 241.3010517
IONMODE: Positive
COLLISIONENERGY:
Comment: Annotation level-1; PlaSMA ID-3141; ID title-Malvidin-3,5-di-O-glucoside; Max plant tissue-Standard only
Num Peaks: 0
Every compound has data between NAME
to the next NAME
.
What I'm trying to do Is remove all the compounds whose value in Num Peaks:
is zero (i.e Num Peaks: 0
. if the 12 line of the compound is Num Peaks: 0
delete all the data of thins compound - 12 rows up, to delete).
In the compounds above, it is to delete rows between NAME: Malvidin-3,5-di-O-glucoside; PlaSMA ID-3141
till Num Peaks: 0
Afterward, I need the data to be saved back to txt or msp format.
What I did is only import the data as a list:
with open('path\to\MSMS-Public-Pos-VS15.msp') as f:
lines = f.readlines()
Then create a list with indices, where each compound start link:
indices = [i for i, s in enumerate(lines) if 'NAME' in s]
I think, now I need to append consecutive indices that difference is greater than 14 (meaning has peak num greater than zero) link
# to find the difference between consecutive indices.
v = np.diff(indices)
select those with a difference 14 and add an element zero at the first location
diff14 = np.where(v == 14)
diff14 = np.append([0],diff14[0])
now I want to select only those values that are not in diff14 in order to create a new list with compounds whose number of peaks greater than zero
Now I need some loop to select the correct indices but do not know how:
lines[indices[diff14[0]]: indices[diff14[1]]]
lines[indices[diff14[1]+1] : indices[diff14[2]]]
lines[indices[diff14[2]+1] : lines[indices[diff14[3]]]]
lines[indices[diff14[3]+1] : indices[diff14[4]]]
Any better ideas or hints are greatly appreciated