0

I'm using regex in python to grab the following data from HTML in this line:

<td xyz="123"><a href="blah.html">This is a line</a></td>

The problem is that in the above td line, the xyz="123" and <a href> are optional, so it does not appear in all the table cells. So I can have tds like this:

<tr><td>New line</td></tr>
<tr><td xyz="123"><a href="blah.html">CaptureThis</a></td></tr>

I wrote regex like this:

<tr><td x?y?z?=?"?(\d\d\d)?"?>?<?a?.*?>?(.*?)?<?/?a?>?</td></tr>

I basically want to capture the "123" data (if present) and the "CaptureThis" data from all tds in each tr.

This regex is not working, and is skipping the the lines without "xyz" data.

I know using regex is not the apt solution here, but was wondering if it could be done with regex alone.

Peter O.
  • 32,158
  • 14
  • 82
  • 96
user1644208
  • 105
  • 5
  • 12
  • 3
    Do not use regex to parse HTML ! – hsz Sep 10 '12 at 07:58
  • Just putting a ? after each optional character does not work, as this introduces lots of (unwanted) possibilities. You need to group sets of optional parts. – Veger Sep 10 '12 at 08:00
  • Martijn answer is correct, anyway you shouldn't just put all those '?'. I'd write something like(untested):`()?(.*?)()?` – Bakuriu Sep 10 '12 at 08:48
  • [Use an XML parser](http://stackoverflow.com/a/1732454/647772). –  Sep 10 '12 at 08:51

3 Answers3

2

You are using a regular expression, and matching XML with such expressions get too complicated, too fast.

Use a HTML parser instead, Python has several to choose from:

ElementTree example:

from xml.etree import ElementTree

tree = ElementTree.parse('filename.html')
for elem in tree.findall('tr'):
    print ElementTree.tostring(elem)
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
0

would you mind parsing the xml file twice? much more simple to solve with regex but unexpected issues might occur since this is not the right way to do it.

'' to match the parameters in td cells '>([\w\s]+)<' to match the "CaptureThis" data

>>> line1
'<tr><td>New line</td></tr>'
>>> line2
'<tr><td xyz="123"><a href="blah.html">CaptureThis</a></td></tr>'  
>>> pattern2 = re.compile(r'>([\w\s]+)<')
>>> pattern2.search(line1).group(1)
'New line'
>>> pattern2.search(line2).group(1)
'CaptureThis'

>>> pattern = re.compile(r'<td\s+\w+="([^"]*)">')
>>> pattern.search(line2).group(1)
'123'

not fully tested though.

oyss
  • 662
  • 1
  • 8
  • 20
0

The following code searches for the matches in the whole string and lists all the matches(even if there are more than one).

>>> text = '''<tr><td>New line</td></tr>
<tr><td xyz="123"><a href="blah.html">CaptureThis</a></td></tr>
<tr><td xyz="456">CaptureThisAlso</td></tr>
'''

>>> re.findall(r'<tr><td(?: xyz="(\d+)")?>(?:<a href=".*?">)?(.*?)(?:</a>)?</td></tr>', text)
[('', 'New line'), ('123', 'CaptureThis'), ('456', 'CaptureThisAlso')]
SUB0DH
  • 5,130
  • 4
  • 29
  • 46