0

I have been trying to match tag names only (without the < and > signs) is cases of regular tags:

<w:tag w:attrib1="http://url" w:attrib2="anyValue">

without matching solo tags (opening-closing tags):

<w:tag2 w:attrib1="anyValue" w:attrib2="http://url" />

(please pay attention to the URLs in the attributes as they contain forward slashes (/))

but could not manage to get to it with:

regex = re.compile('(?<=<)w:\w+(?=[\w\W]+>)(?!\s/>)')

print(regex.findall(string))

getting this:

['w:tag','w:tag2']

expecting this:

['w:tag']

any thoughts?

Cheers.

devdc
  • 161
  • 1
  • 4
  • 13

2 Answers2

1

1) Go easy on the lookahead/lookbehind; they're hard to control and you rarely really need them. Use capturing groups to extract part of the matched string. Use negative character classes and non-greedy search (if needed) to avoid matching too much:

re.findall(r'<\s*(w:\w+)[^>]*(?<!/)>', string)

Easier to read, isn't it? However,

2) Don't do this at all! Don't rely on REs to match XML or html, you're just asking for heartbreak. See https://stackoverflow.com/a/1732454/699305 for the details. :-) Get familiar with using python's xml.etree.ElementTree with xpath expressions instead. It'll take some getting used to, but it will be time well spent-- you won't regret it.

Community
  • 1
  • 1
alexis
  • 48,685
  • 16
  • 101
  • 161
  • I know XML and lxml too well and loving it. Although this time I Need to handle some broken stuff... thanks for your detailed answer. It works like charm and indeed looks better than what I've come up with. – devdc Oct 27 '12 at 22:34
0

Found it:

regex = re.compile('(?<=<)w:\w+(?=>)|(?<=<)w:\w+(?=[\s\w+:\w+="[\w/:.-]+"]{0,10}>)')
Emil
  • 7,220
  • 17
  • 76
  • 135
devdc
  • 161
  • 1
  • 4
  • 13