Python XML regular expression matching issue

Question

I have been trying to match tag names only (without the < and > signs) is cases of regular tags:

<w:tag w:attrib1="http://url" w:attrib2="anyValue">

without matching solo tags (opening-closing tags):

<w:tag2 w:attrib1="anyValue" w:attrib2="http://url" />

(please pay attention to the URLs in the attributes as they contain forward slashes (/))

but could not manage to get to it with:

regex = re.compile('(?<=<)w:\w+(?=[\w\W]+>)(?!\s/>)')

print(regex.findall(string))

getting this:

['w:tag','w:tag2']

expecting this:

['w:tag']

any thoughts?

Cheers.

score 1 · Answer 1 · edited May 23 '17 at 12:27

1) Go easy on the lookahead/lookbehind; they're hard to control and you rarely really need them. Use capturing groups to extract part of the matched string. Use negative character classes and non-greedy search (if needed) to avoid matching too much:

re.findall(r'<\s*(w:\w+)[^>]*(?<!/)>', string)

Easier to read, isn't it? However,

2) Don't do this at all! Don't rely on REs to match XML or html, you're just asking for heartbreak. See https://stackoverflow.com/a/1732454/699305 for the details. :-) Get familiar with using python's xml.etree.ElementTree with xpath expressions instead. It'll take some getting used to, but it will be time well spent-- you won't regret it.

I know XML and lxml too well and loving it. Although this time I Need to handle some broken stuff... thanks for your detailed answer. It works like charm and indeed looks better than what I've come up with. — devdc, Oct 27 '12 at 22:34

score 0 · Accepted Answer · edited Oct 27 '12 at 18:21

0

Found it:

regex = re.compile('(?<=<)w:\w+(?=>)|(?<=<)w:\w+(?=[\s\w+:\w+="[\w/:.-]+"]{0,10}>)')

edited Oct 27 '12 at 18:21

Emil

7,220
17
76
135

answered Oct 27 '12 at 18:00

devdc

161
1
4
13

Python XML regular expression matching issue

2 Answers2