matching elements using OR with regex in python

Question

i m using regex in python to extract data from html. the regex that i ve written is like this:

result = re.findall(r'<td align="left"  csk="(\d\d\d\d)\d\d\d\d"><a href=.?*>(.*?)</a></td>\s+|<td align="lef(.*?)" >(.*?)</td>\s+', webpage)

assuming that this will the td which follows either of the format -

<td align="left"  csk="(\d\d\d\d)\d\d\d\d"><a href=.?*>(.*?)</a></td>\s+

OR

<td align="lef(.*?)" >(.*?)</td>

this is because the td can take different format in that particular cell (either have data with a link, or even just have no data at all).

I assume that the OR condition that i ve used is incorrect - believe that the OR is matching only the "just" preceding regex and the "just" following regex, and not between the two entire td tags.

my question is, how do i group it (for example with paranthesis), so that the OR is matched between the entire td tags.

Please, don't parse html with regex. Take a look at [this](http://stackoverflow.com/a/1732454/1248554)! — BrtH, Sep 10 '12 at 15:07
i understand the limitations of regex. I was wondering about how the OR can be applied in general and under such situations :) — user1644208, Sep 10 '12 at 15:13

score 3 · Answer 1 · answered Sep 10 '12 at 15:07

3

You are using a regular expression, but matching XML with such expressions gets too complicated, too fast.

Use a HTML parser instead, Python has several to choose from:

ElementTree is part of the standard library
BeautifulSoup is a popular 3rd party library
lxml is a fast and feature-rich C-based library.

ElementTree example:

from xml.etree import ElementTree

tree = ElementTree.parse('filename.html')
for elem in tree.findall('tr'):
    print ElementTree.tostring(elem)

answered Sep 10 '12 at 15:07

Martijn Pieters

1,048,767
296
4,058
3,343

as long as there is no nesting it should theoretically work... that said regex is a poor tool choice for parsing xml/html – Joran Beasley Sep 10 '12 at 15:50

Max · Answer 2 · 2012-09-10T15:46:12.303

0

In <td align="left" csk="(\d\d\d\d)\d\d\d\d"><a href=.?*>(.*?)</a></td>\s+ the .?* should be replaced with .*?.

And, to answer your question, you can use non-capturing grouping to do what you want as follows:

(?:first_regex)|(?:second_regex)

BTW. You can also replace \d\d\d\d with \d{4}, which I think is easier to read.

edited Sep 10 '12 at 15:46

answered Sep 10 '12 at 15:41

Max

19,654
13
84
122

matching elements using OR with regex in python

2 Answers2