0

I have a big text file downloaded from MS-DIAL metabolomics MSP spectral kit containing EI-MS, MS/MS

The file is opened as txt file of compounds that look like that:

NAME: C11H11NO5; PlaSMA ID-967
PRECURSORMZ: 238.0712
PRECURSORTYPE: [M+H]+
FORMULA: C11H11NO5
Ontology: Formula predicted
INCHIKEY:
SMILES:
RETENTIONTIME: 1.74
CCS: -1
IONMODE: Positive
COLLISIONENERGY:
Comment: Annotation level-3; PlaSMA ID-967; ID title-AC_Bulb_Pos-629; Max plant tissue-LE_Ripe_Pos
Num Peaks: 2
192.06602   53
238.0757    31

NAME: Malvidin-3,5-di-O-glucoside; PlaSMA ID-3141
PRECURSORMZ: 656.19415
PRECURSORTYPE: [M+H]+
FORMULA: C29H35O17
Ontology: Anthocyanidin O-glycosides
INCHIKEY: CILLXFBAACIQNS-UHFFFAOYNA-O
SMILES: COC1=CC(=CC(OC)=C1O)C1=C(OC2OC(CO)C(O)C(O)C2O)C=C2C(OC3OC(CO)C(O)C(O)C3O)=CC(O)=CC2=[O+]1
RETENTIONTIME: 2.81
CCS: 241.3010517
IONMODE: Positive
COLLISIONENERGY:
Comment: Annotation level-1; PlaSMA ID-3141; ID title-Malvidin-3,5-di-O-glucoside; Max plant tissue-Standard only
Num Peaks: 0

Every compound has data between NAME to the next NAME.

What I'm trying to do Is remove all the compounds whose value in Num Peaks: is zero (i.e Num Peaks: 0. if the 12 line of the compound is Num Peaks: 0 delete all the data of thins compound - 12 rows up, to delete).

In the compounds above, it is to delete rows between NAME: Malvidin-3,5-di-O-glucoside; PlaSMA ID-3141 till Num Peaks: 0 Afterward, I need the data to be saved back to txt or msp format.

What I did is only import the data as a list:

with open('path\to\MSMS-Public-Pos-VS15.msp') as f:
    lines = f.readlines()

Then create a list with indices, where each compound start link:

indices = [i for i, s in enumerate(lines) if 'NAME' in s]

I think, now I need to append consecutive indices that difference is greater than 14 (meaning has peak num greater than zero) link

# to find the difference between consecutive indices.

v = np.diff(indices)

select those with a difference 14 and add an element zero at the first location


diff14 = np.where(v == 14)

diff14 = np.append([0],diff14[0])

now I want to select only those values that are not in diff14 in order to create a new list with compounds whose number of peaks greater than zero

Now I need some loop to select the correct indices but do not know how:

lines[indices[diff14[0]]: indices[diff14[1]]]

lines[indices[diff14[1]+1] : indices[diff14[2]]]

lines[indices[diff14[2]+1] : lines[indices[diff14[3]]]]

lines[indices[diff14[3]+1] : indices[diff14[4]]]

Any better ideas or hints are greatly appreciated

TaL
  • 173
  • 2
  • 15
  • What you need here is to parse your input into e.g. a list of lists, each element holding a single compound. I would suggest 3 steps: (1) Parse the data into a list of Compounds, (2) Iterate this list of Compounds, remove the ones you do not want, (3) output the list back to file. Depending on how large the file is, either do this with 1 loop over the data or 3 separate loops. – oystein Feb 07 '21 at 14:48
  • @oystein - "(1) Parse the data into a list of Compounds" - This is the greatest difficulty for me - How do I do that? – TaL Feb 07 '21 at 15:03
  • I posted my suggestion as an answer https://stackoverflow.com/a/66090874/1942837 – oystein Feb 07 '21 at 17:24

3 Answers3

1

Here is a fairly straightforward way to process the file.

Open up the data file and iterate over its lines, storing them in a list (cache). If a line starts with NAME: then this line is the beginning of a new record and the cache can be printed if it is not empty.

If the line starts with Num Peaks: then the value is checked. If it's zero then the cache is emptied causing this record to be forgotten.

Lines consisting of whitespace only are skipped.

with open('data') as f:
    line_cache = []
    for line in f:
        if line.startswith('NAME:'):
            if line_cache:
                print(*line_cache, sep='')
                line_cache = []
        elif line.startswith('Num Peaks:'):
            num_peaks = int(line.partition(': ')[2])
            if num_peaks == 0:
                line_cache = []
                continue

        if line.strip():        # filter empty lines
            line_cache.append(line)

    if line_cache:    # don't forget the last record
        print(*line_cache, sep='', end='')

The output goes to stdout. It can be redirected onto a file in your shell environment. If you want to write directly to a file you can open it at the start and modify the print() statements:

with open('output', 'w') as output, open('data') as f:
    ...

And change the print()s to

print(*line_cache, sep='', file=output)
mhawke
  • 84,695
  • 9
  • 117
  • 138
  • Thank you @mhawke! I have 2 unclear questions with your permission: What is the meaning of the if statement - if line_cache: ? ( if line_cache is an empty list ?). my second question is about the: with open('output'), 'w') as output, open('data') as f: it show me somthing is wrong and Im not sure what it is. Many thanks for your help – TaL Feb 07 '21 at 15:24
  • @TaL: yes, `if line_cache:` tests for an empty list. So it's saying "if there are lines in the line cache". The other thing is just a typo - remove the extra paren. I've updated the answer to fix that. – mhawke Feb 07 '21 at 22:29
1
# Open / read tmp file created with the text you supplied
filedat = open('tmpWrt.txt','r')
filelines = filedat.readlines()

# Open output file object
file_out = open('tmp_out.txt','w')

line_count = 0

# Iterate through all file lines
for line in filelines:
    # If line is beginning of section
    # reset tmp variables
    if line != "\n" and line.split()[0] == "NAME:":
        tmp_lines = []
        flag = 'n'

    tmp_lines.append(line)
    line_count += 1

    # If line is the end of a section and peaks > 0
    # write to file
    if (line == "\n" or line_count == len(filelines)) and flag == 'y':
        #tmp_lines.append("\n")
        for tmp_line in tmp_lines:
            file_out.write(tmp_line)

    # If peaks > 0 set flag to "y"
    if line != "\n" and line.split()[0] == "Num":
            if int(line.split()[2]) != 0:
                flag = "y"

file_out.close()
Jayj
  • 31
  • 2
  • Thank You Jayj! appreciate your time! I'd like to ask if I want there will be an empty line that separating between each compound (section), how can do I do that? many thank !! – TaL Feb 07 '21 at 15:57
  • Good point TaL. I edited the code to add a line at the end of each section. Hopefully that is what you were looking for. – Jayj Feb 07 '21 at 16:22
  • Thank You again Jayj! I tried to run the newly edited code but there are still no empty lines between sections. Many thanks! – TaL Feb 07 '21 at 16:51
  • TaL, I moved some of the lines around and made the code a little more robust. The output file should have a blank line between each section. – Jayj Feb 07 '21 at 17:37
  • Jayj - The code works great! many many thanks for your time and help! – TaL Feb 08 '21 at 08:16
  • Nice! Glad to hear it. – Jayj Feb 08 '21 at 16:07
1

This is not as compact and memory efficient as the other answers, but the hope is that it should be easier to understand and extend.

My suggested approach is to parse your input into e.g. a list of lists, each element holding a single compound. I would suggest 3 steps: (1) Parse the data into a list of Compounds, (2) Iterate this list of Compounds, remove the ones you do not want, (3) output the list back to file. Depending on how large the file is, either do this with 1 loop over the data or 3 separate loops.

# Step (1) Parse the file
compounds = list() # store all compunds
with open('compound.txt', 'r') as f:
    # stores a single compound as a list of rows for a given compound.
    # Note: can be improved to e.g. a dictionary or a custom class
    current_compound = list()
    for line in f:
        if line.strip() == '': # assumes each compound is split by empty line(s)
            print('Empty line')
            # Store previous compound
            if len(current_compound) != 0:
                compounds.append(list(current_compound))

            # prepare for next compound
            current_compound = list()
        else:
            # At this point we could parse this more,
            # e.g. seperate into key/value, but lets just append the whole line with trailing newline
            print('Adding', line.strip())
            current_compound.append(line)

Okay, now lets check our progress

for item in compounds:
    print('\n===Compound===\n', item)

results in

===Compound===
 ['NAME: C11H11NO5; PlaSMA ID-967\n', 'PRECURSORMZ: 238.0712\n', 'PRECURSORTYPE: [M+H]+\n', 'FORMULA: C11H11NO5\n', 'Ontology: Formula predicted\n', 'INCHIKEY:\n', 'SMILES:\n'\
, 'RETENTIONTIME: 1.74\n', 'CCS: -1\n', 'IONMODE: Positive\n', 'COLLISIONENERGY:\n', 'Comment: Annotation level-3; PlaSMA ID-967; ID title-AC_Bulb_Pos-629; Max plant tissue-LE\
_Ripe_Pos\n', 'Num Peaks: 2\n', '192.06602   53\n', '238.0757    31\n']

===Compound===
 ['NAME: Malvidin-3,5-di-O-glucoside; PlaSMA ID-3141\n', 'PRECURSORMZ: 656.19415\n', 'PRECURSORTYPE: [M+H]+\n', 'FORMULA: C29H35O17\n', 'Ontology: Anthocyanidin O-glycosides\n\
', 'INCHIKEY: CILLXFBAACIQNS-UHFFFAOYNA-O\n', 'SMILES: COC1=CC(=CC(OC)=C1O)C1=C(OC2OC(CO)C(O)C(O)C2O)C=C2C(OC3OC(CO)C(O)C(O)C3O)=CC(O)=CC2=[O+]1\n', 'RETENTIONTIME: 2.81\n', '\
CCS: 241.3010517\n', 'IONMODE: Positive\n', 'COLLISIONENERGY:\n', 'Comment: Annotation level-1; PlaSMA ID-3141; ID title-Malvidin-3,5-di-O-glucoside; Max plant tissue-Standard\
 only\n', 'Num Peaks: 0\n']

You can then iterate through this compounds list and remove the ones with Num Peaks set to 0 before writing back to file. Let me know if you need help with this part as well.

oystein
  • 1,507
  • 1
  • 11
  • 13