0

i m using regex in python to extract data from html. the regex that i ve written is like this:

result = re.findall(r'<td align="left"  csk="(\d\d\d\d)\d\d\d\d"><a href=.?*>(.*?)</a></td>\s+|<td align="lef(.*?)" >(.*?)</td>\s+', webpage)

assuming that this will the td which follows either of the format -

<td align="left"  csk="(\d\d\d\d)\d\d\d\d"><a href=.?*>(.*?)</a></td>\s+

OR

<td align="lef(.*?)" >(.*?)</td>

this is because the td can take different format in that particular cell (either have data with a link, or even just have no data at all).

I assume that the OR condition that i ve used is incorrect - believe that the OR is matching only the "just" preceding regex and the "just" following regex, and not between the two entire td tags.

my question is, how do i group it (for example with paranthesis), so that the OR is matched between the entire td tags.

user1644208
  • 105
  • 5
  • 12
  • 3
    Please, don't parse html with regex. Take a look at [this](http://stackoverflow.com/a/1732454/1248554)! – BrtH Sep 10 '12 at 15:07
  • i understand the limitations of regex. I was wondering about how the OR can be applied in general and under such situations :) – user1644208 Sep 10 '12 at 15:13

2 Answers2

3

You are using a regular expression, but matching XML with such expressions gets too complicated, too fast.

Use a HTML parser instead, Python has several to choose from:

ElementTree example:

from xml.etree import ElementTree

tree = ElementTree.parse('filename.html')
for elem in tree.findall('tr'):
    print ElementTree.tostring(elem)
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • as long as there is no nesting it should theoretically work... that said regex is a poor tool choice for parsing xml/html – Joran Beasley Sep 10 '12 at 15:50
0

In <td align="left" csk="(\d\d\d\d)\d\d\d\d"><a href=.?*>(.*?)</a></td>\s+ the .?* should be replaced with .*?.

And, to answer your question, you can use non-capturing grouping to do what you want as follows:

(?:first_regex)|(?:second_regex)

BTW. You can also replace \d\d\d\d with \d{4}, which I think is easier to read.

Max
  • 19,654
  • 13
  • 84
  • 122