2

I am trying to not match words that are followed or preceded by an XML tag.

import re

strTest = "<random xml>hello this was successful price<random xml>"

for c in re.finditer(r'(?<![<>])(\b\w+\b)(?<!=[<>])(\W+)',strTest):
     c1 = c.group(1)
     c2 = c.group(2)
     if ('<' != c2[0]) and ('<' != c.group(1)[len(c.group(1))-1]):
          print c1

Result is:

xml
this
was
successful
xml

Wanted Result:

this
was
successful

I have been trying negative lookahead and negative lookbehind assertions. I'm not sure if this is the right approach, I would appreciate any help.

Sumner Evans
  • 8,951
  • 5
  • 30
  • 47
Bman425
  • 37
  • 6
  • 2
    You don't use regex to parse XML. Ever. Use an XML parser. Python has one [built in](https://docs.python.org/3/library/xml.etree.elementtree.html). Or install [lxml](http://lxml.de/). – Tomalak Jul 26 '17 at 15:13
  • 1
    **[Don't use Regexp to parse XML](https://stackoverflow.com/a/1732454/1954610)**. Use an XML parser. – Tom Lord Jul 26 '17 at 15:15
  • 1
    [A trick](http://www.rexegg.com/regex-best-trick.html#thetrick) can be: Match what you don't want, but [capture](http://www.regular-expressions.info/brackets.html) what you need. [`\w*\s*<[^>]*>\s*\w*|(\w+)`](https://regex101.com/r/bpaYAY/1) – bobble bubble Jul 26 '17 at 15:45

4 Answers4

2

First, to answer your question directly:

I do it by examining each 'word' consisting of a sequence of characters containing (mainly) alphabetics or '<' or '>'. When the regex offers them to some_only I look for one of the latter two characters. If neither appears I print the 'word'.

>>> import re
>>> strTest = "<random xml>hello this was successful price<random xml>"
>>> def some_only(matchobj):
...     if '<' in matchobj.group() or '>' in matchobj.group():
...         pass
...     else:
...         print (matchobj.group())
...         pass
... 
>>> ignore = re.sub(r'[<>\w]+', some_only, strTest)
this
was
successful

This works for your test string; however, as others have already mentioned, using a regex on xml will usually lead to many woes.

To use a more conventional approach I had to tidy away a couple of errors in that xml string, namely to change random xml to random_xml and to using a proper closing tag.

I prefer to use the lxml library.

>>> strTest = "<random_xml>hello this was successful price</random_xml>"
>>> from lxml import etree
>>> tree = etree.fromstring(strTest)
>>> tree.text
'hello this was successful price'
>>> tree.text.split(' ')[1:-1]
['hello', 'this', 'was', 'successful', 'price']
>>> tree.text.split(' ')[1:-1]
['this', 'was', 'successful']
Bill Bell
  • 21,021
  • 5
  • 43
  • 58
  • I really like this solution, but I only want to use stdlib. How could this be done using xml.etree.ElementTree. BTW I am running Python 2.7. – Bman425 Jul 26 '17 at 15:55
  • 1
    @Bman425, it's basically identical. `import xml.etree.ElementTree as ET; tree = ET.fromstring(strTest); print tree.text.split(' ')[1:-1]` – Charles Duffy Jul 26 '17 at 16:00
  • BTW, there's probable some work that could be done here to improve this answer's applicability -- descending the tree looking for elements and incorporating `.tail` as well as `.text`, for example; the OP's sample input is clearly inadequate to their actual intent. – Charles Duffy Jul 26 '17 at 16:05
  • Agreed. My concern would be that this might easily go beyond the OP's skills level. As it is, simple question, simple answer. – Bill Bell Jul 26 '17 at 16:08
0

I'll give it a shot. Since we are already doing more than just a regex, put it into a list and drop the first/last items:

import re

strTest = "<random xml>hello this was successful price<random xml>"

thelist = []

for c in re.finditer(r'(?<![<>])(\b\w+\b)(?<!=[<>])(\W+)',strTest):
     c1 = c.group(1)
     c2 = c.group(2)
     if ('<' != c2[0]) and ('<' != c.group(1)[len(c.group(1))-1]):
          thelist.append(c1)

thelist = thelist[1:-1]

print (thelist)

result:

['this', 'was', 'successful']

I would personally try to parse the XML instead, but since you have this code already up this slight modification could do the trick.

sniperd
  • 5,124
  • 6
  • 28
  • 44
  • This works well for the example I put, but I am worried it will not scale well. I agree that I should try using an XML parser. – Bman425 Jul 26 '17 at 15:59
0

A simple way to do it, with a list, but I am supposing the followed or preceded word by an XML tag and the proper tag are not separated by an space:

test = "<random xml>hello this was successful price<random xml>"

test = test.split()

new_test = []
for val in test:
  if "<" not in val and ">" not in val:
   new_test.append(val)

print(new_test)

The result will be:

['this', 'was', 'successful']
0

My soultion...

I don't see the need to use regex at all, you could solve it in a one-line list comprehension:

words = [w for w in test.split() if "<" not in w and ">" not in w]
Community
  • 1
  • 1
Joe Iddon
  • 20,101
  • 7
  • 33
  • 54