Python regex too greedy, misses first occurence in XML

Question

I have the following Python regex:

xml_parse = re.search(r'^.+?<Hit_accession>(\w+?)</Hit_accession>.+?<Hsp_qseq>(\w+?)</Hsp_qseq>\s+?<Hsp_hseq>(\w+?)</Hsp_hseq>\s+?<Hsp_midline>(.+?)</Hsp_midline>',string,flags=re.DOTALL)

for the following text:

<?xml version="1.0"?>
 <Hit_accession> Desired Group #1 </Hit_accession>
<Hsp>
 <Hsp_qseq> Desired Group # 2 </Hsp_qseq>
 <Hsp_hseq> Desired Group # 3 </Hsp_hseq>
 <Hsp_midline> Desired Group # 4 </Hsp_midline>
</Hsp>

... way later in the XML string

 <Hit_accession> Undesired Group #1 </Hit_accession>
<Hsp>
 <Hsp_qseq> Undesired Group # 2 </Hsp_qseq>
 <Hsp_hseq> Undesired Group # 3 </Hsp_hseq>
 <Hsp_midline> Undesired Group # 4 </Hsp_midline>
</Hsp>

The groups that are being returned are:
(1) Desired Group #1
(2) Undesired Group #2
(3) Undesired Group #3
(4) Undesired Group #4

Why is this happening? Since I'm getting Desired Group #1 and using non-greedy .+? with flags=re.DOTALL, I would expect that it would not skip over any of my Desired Groups 2-4.

Thanks in advance.

UPDATE:

Ended up using xml.etree.ElementTree as follows:

tree = xml.etree.ElementTree.fromstring(string)
iteration = tree.find("BlastOutput_iterations/Iteration")
hits = iteration.findall("Iteration_hits/Hit")
topHit = hits[0]
accessionNCBI = topHit.findtext("Hit_accession")

Found the following link useful for NCBI BLAST specific XML parsing examples: http://www.dalkescientific.com/writings/NBN/elementtree.html

**`xml_parse = re.search(...`** aaaaaaarrrrrggggggghhhhh!!!! -- Please follow this [link](http://stackoverflow.com/a/1732454/1132524). — Rik Poggi, Mar 04 '12 at 09:47
What's the difference between the desired and undesired groups other than your lack of affection for one? — Burhan Khalid, Mar 04 '12 at 09:51
The input is apparently supposed to be XML, but what is shown is not well-formed. — mzjn, Mar 04 '12 at 10:03
I only want the first occurrence (the "top hit"), so all the downstream information is irrelevant — ncemami, Mar 04 '12 at 12:34
Simple answer: don't use regular expressions to parse XML. That's what XML parsers are for. — Michael Kay, Mar 04 '12 at 13:15

score 5 · Accepted Answer · answered Mar 04 '12 at 09:49

5

Hmmm, XML and a Regex. Looks like fun.

How about you use a built-in Python XML library, like libxml2 or ElementTree?

from xml.etree.ElementTree import ElementTree
doc = ElementTree(file='myfile.xml')

for e in doc.findall('/Hit_accession'):
    print e.get('Hsp_qseq').text

Seriously, you'll save yourself a lot of headaches. Regex is not meant for XML parsing.

answered Mar 04 '12 at 09:49

Jordan

31,971
6
56
67

Btw, the built-in ElementTree is in `xml.etree.ElementTree`. – Lukáš Lalinský Mar 04 '12 at 09:51
`get()` is for getting attributes. There are no attributes in the input (which is not even well-formed). – mzjn Mar 04 '12 at 10:09
Thanks for the recommendation, I'm not sure if your code works exactly for the type of parsing I am doing but I used it as a guide for the code I have appended above. – ncemami Mar 04 '12 at 12:04

Python regex too greedy, misses first occurence in XML

1 Answers1