0

I am trying to extract the text between <th> tags from an HTML table. The following code explains the problem

searchstr = '<th class="c1">data 1</th><th>data 2</th>'
p = re.compile('<th\s+.*?>(.*?)</th>|<th>(.*?)</th>')
for i in p.finditer(searchstr):print i.group(1)

The output produced by the code is

data 1
None

If I change the pattern to <th>(.*?)</th>|<th\s+.*?>(.*?)</th> the output changes to

None
data 2

What is the correct way to catch the group in both cases.I am not using the pattern <th.*?>(.*?)</th> because there may be <thead> tags in the search string.

Ravi K
  • 3
  • 2
  • 5
    Don't [use regex to parse HTML](http://stackoverflow.com/a/1732454/5827958). – zondo Mar 13 '16 at 14:17
  • just make it simple, `re.compile(r']*>(.*?)')` . But it's better to follow ALex's answer. – Avinash Raj Mar 13 '16 at 14:19
  • @AvinashRaj I used that regex and it solved my problem(If you answer this, I can mark that as a solution). Alex's solution is much more elegant, and I would have used it but I didn't want to install a package for using it in just one place. – Ravi K Mar 13 '16 at 14:39

2 Answers2

5

Why don't use an HTML Parser instead - BeautifulSoup, for example:

>>> from bs4 import BeautifulSoup
>>> str = '<th class="c1">data 1</th><th>data 2</th>'
>>> soup = BeautifulSoup(str, "html.parser")
>>> [th.get_text() for th in soup.find_all("th")]
[u'data 1', u'data 2']

Also note that str is a bad choice for a variable name - you are shadowing a built-in str.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I would prefer not installing a package as there is just one instance where I need scraping, most of the work is done through APIs. I corrected the variable name, I had put it in the example trying to simplify the post – Ravi K Mar 13 '16 at 14:35
1

You may reduce the regex like below with one capturing group.

re.compile(r'(?s)<th\b[^>]*>(.*?)</th>')
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274