Match word if not followed or preceded by < or >

Question

I am trying to not match words that are followed or preceded by an XML tag.

import re

strTest = "<random xml>hello this was successful price<random xml>"

for c in re.finditer(r'(?<![<>])(\b\w+\b)(?<!=[<>])(\W+)',strTest):
     c1 = c.group(1)
     c2 = c.group(2)
     if ('<' != c2[0]) and ('<' != c.group(1)[len(c.group(1))-1]):
          print c1

Result is:

xml
this
was
successful
xml

Wanted Result:

this
was
successful

I have been trying negative lookahead and negative lookbehind assertions. I'm not sure if this is the right approach, I would appreciate any help.

You don't use regex to parse XML. Ever. Use an XML parser. Python has one [built in](https://docs.python.org/3/library/xml.etree.elementtree.html). Or install [lxml](http://lxml.de/). — Tomalak, Jul 26 '17 at 15:13
**[Don't use Regexp to parse XML](https://stackoverflow.com/a/1732454/1954610)**. Use an XML parser. — Tom Lord, Jul 26 '17 at 15:15
[A trick](http://www.rexegg.com/regex-best-trick.html#thetrick) can be: Match what you don't want, but [capture](http://www.regular-expressions.info/brackets.html) what you need. [`\w*\s*<[^>]*>\s*\w*|(\w+)`](https://regex101.com/r/bpaYAY/1) — bobble bubble, Jul 26 '17 at 15:45

Bill Bell · Accepted Answer · 2017-07-26T16:06:04.787

First, to answer your question directly:

I do it by examining each 'word' consisting of a sequence of characters containing (mainly) alphabetics or '<' or '>'. When the regex offers them to some_only I look for one of the latter two characters. If neither appears I print the 'word'.

>>> import re
>>> strTest = "<random xml>hello this was successful price<random xml>"
>>> def some_only(matchobj):
...     if '<' in matchobj.group() or '>' in matchobj.group():
...         pass
...     else:
...         print (matchobj.group())
...         pass
... 
>>> ignore = re.sub(r'[<>\w]+', some_only, strTest)
this
was
successful

This works for your test string; however, as others have already mentioned, using a regex on xml will usually lead to many woes.

To use a more conventional approach I had to tidy away a couple of errors in that xml string, namely to change random xml to random_xml and to using a proper closing tag.

I prefer to use the lxml library.

>>> strTest = "<random_xml>hello this was successful price</random_xml>"
>>> from lxml import etree
>>> tree = etree.fromstring(strTest)
>>> tree.text
'hello this was successful price'
>>> tree.text.split(' ')[1:-1]
['hello', 'this', 'was', 'successful', 'price']
>>> tree.text.split(' ')[1:-1]
['this', 'was', 'successful']

I really like this solution, but I only want to use stdlib. How could this be done using xml.etree.ElementTree. BTW I am running Python 2.7. — Bman425, Jul 26 '17 at 15:55
@Bman425, it's basically identical. `import xml.etree.ElementTree as ET; tree = ET.fromstring(strTest); print tree.text.split(' ')[1:-1]` — Charles Duffy, Jul 26 '17 at 16:00
BTW, there's probable some work that could be done here to improve this answer's applicability -- descending the tree looking for elements and incorporating `.tail` as well as `.text`, for example; the OP's sample input is clearly inadequate to their actual intent. — Charles Duffy, Jul 26 '17 at 16:05
Agreed. My concern would be that this might easily go beyond the OP's skills level. As it is, simple question, simple answer. — Bill Bell, Jul 26 '17 at 16:08

score 0 · Answer 2 · answered Jul 26 '17 at 15:52

I'll give it a shot. Since we are already doing more than just a regex, put it into a list and drop the first/last items:

import re

strTest = "<random xml>hello this was successful price<random xml>"

thelist = []

for c in re.finditer(r'(?<![<>])(\b\w+\b)(?<!=[<>])(\W+)',strTest):
     c1 = c.group(1)
     c2 = c.group(2)
     if ('<' != c2[0]) and ('<' != c.group(1)[len(c.group(1))-1]):
          thelist.append(c1)

thelist = thelist[1:-1]

print (thelist)

result:

['this', 'was', 'successful']

I would personally try to parse the XML instead, but since you have this code already up this slight modification could do the trick.

This works well for the example I put, but I am worried it will not scale well. I agree that I should try using an XML parser. — Bman425, Jul 26 '17 at 15:59

score 0 · Answer 3 · answered Jul 26 '17 at 16:00

A simple way to do it, with a list, but I am supposing the followed or preceded word by an XML tag and the proper tag are not separated by an space:

test = "<random xml>hello this was successful price<random xml>"

test = test.split()

new_test = []
for val in test:
  if "<" not in val and ">" not in val:
   new_test.append(val)

print(new_test)

The result will be:

['this', 'was', 'successful']

score 0 · Answer 4 · edited Jun 20 '20 at 09:12

0

My soultion...

I don't see the need to use regex at all, you could solve it in a one-line list comprehension:

words = [w for w in test.split() if "<" not in w and ">" not in w]

edited Jun 20 '20 at 09:12

Community

1
1

answered Jul 29 '17 at 15:31

Joe Iddon

20,101
7
33
54

Match word if not followed or preceded by < or >

4 Answers4

My soultion...