Matching a group with OR condition in pattern

Question

I am trying to extract the text between <th> tags from an HTML table. The following code explains the problem

searchstr = '<th class="c1">data 1</th><th>data 2</th>'
p = re.compile('<th\s+.*?>(.*?)</th>|<th>(.*?)</th>')
for i in p.finditer(searchstr):print i.group(1)

The output produced by the code is

data 1
None

If I change the pattern to <th>(.*?)</th>|<th\s+.*?>(.*?)</th> the output changes to

None
data 2

What is the correct way to catch the group in both cases.I am not using the pattern <th.*?>(.*?)</th> because there may be <thead> tags in the search string.

Don't [use regex to parse HTML](http://stackoverflow.com/a/1732454/5827958). — zondo, Mar 13 '16 at 14:17
just make it simple, `re.compile(r']*>(.*?)')` . But it's better to follow ALex's answer. — Avinash Raj, Mar 13 '16 at 14:19
@AvinashRaj I used that regex and it solved my problem(If you answer this, I can mark that as a solution). Alex's solution is much more elegant, and I would have used it but I didn't want to install a package for using it in just one place. — Ravi K, Mar 13 '16 at 14:39

score 5 · Answer 1 · edited May 23 '17 at 12:23

5

Why don't use an HTML Parser instead - BeautifulSoup, for example:

>>> from bs4 import BeautifulSoup
>>> str = '<th class="c1">data 1</th><th>data 2</th>'
>>> soup = BeautifulSoup(str, "html.parser")
>>> [th.get_text() for th in soup.find_all("th")]
[u'data 1', u'data 2']

Also note that str is a bad choice for a variable name - you are shadowing a built-in str.

edited May 23 '17 at 12:23

Community

1
1

answered Mar 13 '16 at 14:17

alecxe

462,703
120
1,088
1,195

I would prefer not installing a package as there is just one instance where I need scraping, most of the work is done through APIs. I corrected the variable name, I had put it in the example trying to simplify the post – Ravi K Mar 13 '16 at 14:35

score 1 · Accepted Answer · answered Mar 13 '16 at 14:41

1

You may reduce the regex like below with one capturing group.

re.compile(r'(?s)<th\b[^>]*>(.*?)</th>')

answered Mar 13 '16 at 14:41

Avinash Raj

172,303
28
230
274

Matching a group with OR condition in pattern

2 Answers2