Regex in Python for html

Question

I wanted to write a regex expression for:

<td class="prodSpecAtribute" rowspan="2">[words]</td>

or

<td class="prodSpecAtribute">[words]</td>

for the second case I have:

find2 = re.compile('<td class="prodSpecAtribute">(.*)</td>')

But, how can I create a regex which can use either of the 2 expressions

Are you limited to regex in this situation? Sometimes it's safer to not use regex for HTML parsing... (see [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/), or *[something similar](http://htmlparsing.com/)*...) — summea, May 21 '13 at 19:27
Python has a good [HTML parser](http://docs.python.org/2/library/htmlparser.html) — Mike Samuel, May 21 '13 at 19:27
@MikeSamuel: Well, before 2.7.3 and 3.2.something it's actually kind of slow and finicky… but yeah, still better than trying to solve an HTML parsing problem with regex. — abarnert, May 21 '13 at 19:29

Andrew Clark · Accepted Answer · 2013-05-21T20:24:44.707

4

Don't use regular expressions for this, use an HTML parser like BeautifulSoup. For example:

>>> from bs4 import BeautifulSoup
>>> soup1 = BeautifulSoup('<td class="prodSpecAtribute" rowspan="2">[words]</td>')
>>> soup1.find('td', class_='prodSpecAtribute').contents[0]
u'[words]'
>>> soup2 = BeautifulSoup('<td class="prodSpecAtribute">[words]</td>')
>>> soup2.find('td', class_='prodSpecAtribute').contents[0]
u'[words]'

Or to find all matches:

soup = BeautifulSoup(page)
for td in soup.find_all('td', class_='prodSpecAtribute'):
    print td.contents[0]

With BeautifulSoup 3:

soup = BeautifulSoup(page)
for td in soup.findAll('td', {'class': 'prodSpecAtribute'}):
    print td.contents[0]

edited May 21 '13 at 20:24

answered May 21 '13 at 19:30

Andrew Clark

202,379
35
273
306

I might use `find_all` here instead to handle the multiple-tag case. – DSM May 21 '13 at 19:31
@DSM could you please elaborate, I didn't understand your point. Thanks – Josh May 21 '13 at 19:32
Would this be correct: 'soup = BeautifulSoup(page) info = soup.findAll('td', class_= prodSpecAtribute)' – Josh May 21 '13 at 19:40
@F.J I get an error: for elements in soup.findall('td', class_= 'prodSpecAtribtue'): TypeError: 'NoneType' object is not callable – Josh May 21 '13 at 19:51
This indicates that `soup` is `None`, did you import BeautifulSoup and use `soup = BeautifulSoup(page)` before this? – Andrew Clark May 21 '13 at 19:53
See my edit, the issue you were seeing was due to a difference between versions. – Andrew Clark May 21 '13 at 20:25
@F.J So, this is what I'm doing now: soup = BeautifulSoup(page), info = soup.findAll('td', {'class' : 'prodSpecAtribtue'}), print info, do you know why the output is []? – Josh May 21 '13 at 20:37
I find the machinery of Beautiful Soup unjustified in your case. Use the solution of Zsolt Botykai in which ``(.*)`` must however be changed to ``(.*?)`` – eyquem May 21 '13 at 20:37

score 3 · Answer 2 · answered May 21 '13 at 19:29

3

if you ask for a regex:

find2 = re.compile('<td class="prodSpecAtribute"( rowspan="2")?>(.*)</td>')

But I would use BeautifulSoup.

answered May 21 '13 at 19:29

guettli

25,042
81
346
663

Great answer. But you might want to show how simple and readable the `BeautifulSoup` one-liner solution is. – abarnert May 21 '13 at 19:30

score 0 · Answer 3 · answered May 21 '13 at 19:30

0

find2 = re.compile('<td class="prodSpecAtribute"[^>]*>(.*)</td>')

Will work. But there are better solutions for HTML parsing...

answered May 21 '13 at 19:30

Zsolt Botykai

50,406
14
85
110

You must limit the greedy nature of .* – eyquem May 21 '13 at 20:35
@eyquem No I must not. The one who asked a question must. And for the sample data he had provided my solution works. But you are right of course. – Zsolt Botykai May 22 '13 at 13:01

score 0 · Answer 4 · answered May 21 '13 at 21:23

I would not recommend neither regex nor BeautifulSoup. There is a project pyquery http://pythonhosted.org/pyquery/ that is much faster as it uses lxml.html library, speed comparasion can be found here: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/. From my own experience BeautifulSoup is really slow.

So in your situation it is easy as this code:

>>>from pyquery import PyQuery as pq
>>>page = pq('<td class="prodSpecAtribute">[words]</td>')
>>>page('.prodSpecAtribute').text()
>>>'[words]'

Once again BS is really slow.

Regex in Python for html

4 Answers4