Simple regex for simple xml string

Question

I have a string consisting of elements. Each element can contain "pear" or "apple". I can get all the elements using:

s = '<tag>uTSqUYRR8gapple</tag><tag>K9VGTZM3h8</tag><tag>pearTYysnMXMUc</tag><tag>udv5NZQdpzpearz5a4oS85mD</tag>'
import re; re.findall("<tag>.*?</tag>", s)

However, I want to get the last element that contains pear. What would the easiest/quickest way to do this? Is this a good way:

list = re.findall("<tag>.*?</tag>", s)
list.reverse()
last = next(x for x in list if re.match('.*pear', x))
re.match('<tag>(.*)</tag>', last).group(1)

or should I use a parser instead?

[don't use regex to parse xml](http://stackoverflow.com/a/1732454/5323213) — R Nar, May 25 '16 at 20:29
@RNar: that was *exactly* the SO answer I was going to link to. Still one of the greatest answers of all time! ***HE COMES!*** — Peter Rowell, May 25 '16 at 20:45
And yet people still test their fates and attempt to join together the forces of regex and XML. Don't do it people, don't encourage the spawning. — R Nar, May 25 '16 at 20:49

score 1 · Accepted Answer · answered May 25 '16 at 20:33

1

Use a parser, ie BeautifulSoup instead:

import re
from bs4 import BeautifulSoup

s = '<tag>uTSqUYRR8gapple</tag><tag>K9VGTZM3h8</tag><tag>pearTYysnMXMUc</tag><tag>udv5NZQdpzpearz5a4oS85mD</tag>'
soup = BeautifulSoup(s, "html5lib")
tags = soup.find_all(text=re.compile(r'pear'))
print tags
# [u'pearTYysnMXMUc', u'udv5NZQdpzpearz5a4oS85mD']

This sets up the dom and finds all tags where your text matches the regex pear (looking for "pear" literally).
See a demo on ideone.com.

answered May 25 '16 at 20:33

Jan

42,290
8
54
79

BeautifulSoup is specialized in dealing with poorly-formatted HTML. Since the input here is (expected to be) well-formed XML (not HTML at all), it's a rather suboptimal choice. – Charles Duffy May 25 '16 at 20:57

score 0 · Answer 2 · answered May 25 '16 at 21:05

Using a proper XML library will let you use XPath to encapsulate your query. For instance:

s = '<root><tag>uTSqUYRR8gapple</tag><tag>K9VGTZM3h8</tag><tag>pearTYysnMXMUc</tag><tag>udv5NZQdpzpearz5a4oS85mD</tag></root>'

import lxml.etree
root = lxml.etree.fromstring(s)
result = root.xpath('//tag[contains(., "pear")][last()]/text()')

...for which result will contain, for the input data given, ['udv5NZQdpzpearz5a4oS85mD']. In this case you don't need to do the search for the last item in your own code, but can rely on the XPath engine (implemented in C, as part of libxml) to do that for you.

Simple regex for simple xml string

2 Answers2

Linked