Regex in python not taking the specified data in td element

Question

I'm using regex in python to grab the following data from HTML in this line:

<td xyz="123"><a href="blah.html">This is a line</a></td>

The problem is that in the above td line, the xyz="123" and <a href> are optional, so it does not appear in all the table cells. So I can have tds like this:

<tr><td>New line</td></tr>
<tr><td xyz="123"><a href="blah.html">CaptureThis</a></td></tr>

I wrote regex like this:

<tr><td x?y?z?=?"?(\d\d\d)?"?>?<?a?.*?>?(.*?)?<?/?a?>?</td></tr>

I basically want to capture the "123" data (if present) and the "CaptureThis" data from all tds in each tr.

This regex is not working, and is skipping the the lines without "xyz" data.

I know using regex is not the apt solution here, but was wondering if it could be done with regex alone.

Just putting a ? after each optional character does not work, as this introduces lots of (unwanted) possibilities. You need to group sets of optional parts. — Veger, Sep 10 '12 at 08:00
Martijn answer is correct, anyway you shouldn't just put all those '?'. I'd write something like(untested):`()?(.*?)()?` — Bakuriu, Sep 10 '12 at 08:48
[Use an XML parser](http://stackoverflow.com/a/1732454/647772). — , Sep 10 '12 at 08:51

score 2 · Answer 1 · answered Sep 10 '12 at 07:59

You are using a regular expression, and matching XML with such expressions get too complicated, too fast.

Use a HTML parser instead, Python has several to choose from:

ElementTree is part of the standard library
BeautifulSoup is a popular 3rd party library
lxml is a fast and feature-rich C-based library.

ElementTree example:

from xml.etree import ElementTree

tree = ElementTree.parse('filename.html')
for elem in tree.findall('tr'):
    print ElementTree.tostring(elem)

score 0 · Answer 2 · answered Sep 10 '12 at 08:46

would you mind parsing the xml file twice? much more simple to solve with regex but unexpected issues might occur since this is not the right way to do it.

'' to match the parameters in td cells '>([\w\s]+)<' to match the "CaptureThis" data

>>> line1
'<tr><td>New line</td></tr>'
>>> line2
'<tr><td xyz="123"><a href="blah.html">CaptureThis</a></td></tr>'  
>>> pattern2 = re.compile(r'>([\w\s]+)<')
>>> pattern2.search(line1).group(1)
'New line'
>>> pattern2.search(line2).group(1)
'CaptureThis'

>>> pattern = re.compile(r'<td\s+\w+="([^"]*)">')
>>> pattern.search(line2).group(1)
'123'

not fully tested though.

score 0 · Answer 3 · answered Sep 10 '12 at 10:18

The following code searches for the matches in the whole string and lists all the matches(even if there are more than one).

>>> text = '''<tr><td>New line</td></tr>
<tr><td xyz="123"><a href="blah.html">CaptureThis</a></td></tr>
<tr><td xyz="456">CaptureThisAlso</td></tr>
'''

>>> re.findall(r'<tr><td(?: xyz="(\d+)")?>(?:<a href=".*?">)?(.*?)(?:</a>)?</td></tr>', text)
[('', 'New line'), ('123', 'CaptureThis'), ('456', 'CaptureThisAlso')]

Regex in python not taking the specified data in td element

3 Answers3

Linked

Related