Parsing blast output in .xml format

Question

I have a blast output in .xml format, but will not post an example here, since it is huge, unless you really require it. I go specifically to my question. The script below works OK. The only thing is I want to print the hit_def, which has different length in the file followed by a space. How to modify a code to print me hit_def? As you can see if I specify [:8] if will print me 8 characters, but then the length might be 10, 15 etc, how to improve this?

import re
import sys


base = sys.argv[1]
base = base.rstrip('xml')


if fasta_out == True:
    seq_out = open(base+'fasta', 'w')
read_def = set()
with open(sys.argv[1],'rb') as xml:
    for line in xml:
        if re.search('<Iteration_query-def>', line) != None:
            line = line.strip()
            line = line.rstrip()
            line = re.sub('<Iteration_query-def>', '', line)
            line = re.sub('</Iteration_query-def>', '', line)
            query_def = line
        if re.search('<Hit_def>', line) != None:
            line = line.strip()
            line = line.rstrip()
            line = re.sub('<Hit_def>', '', line)
            line = re.sub('</Hit_def>', '', line)
            hit_def = line[:8]
            if fasta_out == True:
                print >> seq_out, query_def+'\t'+hit_def+'\n'
if fasta_out == True:
    seq_out.close()

This is an example how my hit_def looks,

>MLOC_36586.11 pep:known chromosome:030312v2:1:9883453:9888834:-1 gene:MLOC_36586 transcript:MLOC_36586.11 description:"Uncharacterized protein "
>MLOC_36586.2 pep:known chromosome:030312v2:1:9883444:9888847:-1 gene:MLOC_36586 transcript:MLOC_36586.2 description:"Uncharacterized protein "
>MLOC_51.2 pep:known chromosome:030312v2:1:322147737:322148802:1 gene:MLOC_51 transcript:MLOC_51.2 description:"Predicted protein\x3b Uncharacterized protein "
>MLOC_217.1 pep:known chromosome:030312v2:4:519918111:519919326:1 gene:MLOC_217 transcript:MLOC_217.1 description:"Uncharacterized protein "

Desired hit_def's,

MLOC_36586.11
MLOC_36586.2
MLOC_51.2
MLOC_217.1

Why not use an [`XML parser`](http://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python)? — alecxe, May 26 '14 at 14:23
I have been working with this code from the beginning, so did not want to spend another time for another code, if there are ways to improve this one would be great! — user3224522, May 26 '14 at 14:25
I also find it really comfortable to extract translated sequences using regex... — user3224522, May 26 '14 at 14:26
If it's always the first element in the string, you could do something like `hit_def = line[:line.index(' ')]` — Mostly Harmless, May 26 '14 at 15:16
@user3224522 Happy to help. I've moved my comment into a more proper answer, feel free to accept it if you feel it adequately answers your question! — Mostly Harmless, May 26 '14 at 18:09

score 1 · Accepted Answer · answered May 26 '14 at 18:08

If you know it's always the first item in the string, you can do something like this:

hit_def = line[:line.index(' ')]

If it isn't necessarily first, you might go for a regex like this:

hit_def = re.findall(r'(MLOC_\d+\.\d+) ',line)[0]

I'm assuming that your hit_defs are all of the form MLOC_XXX.X, but you get the idea.

Parsing blast output in .xml format

1 Answers1