0

I wanted to write a regex expression for:

<td class="prodSpecAtribute" rowspan="2">[words]</td>

or

<td class="prodSpecAtribute">[words]</td>

for the second case I have:

find2 = re.compile('<td class="prodSpecAtribute">(.*)</td>')

But, how can I create a regex which can use either of the 2 expressions

Josh
  • 3,231
  • 8
  • 37
  • 58

4 Answers4

4

Don't use regular expressions for this, use an HTML parser like BeautifulSoup. For example:

>>> from bs4 import BeautifulSoup
>>> soup1 = BeautifulSoup('<td class="prodSpecAtribute" rowspan="2">[words]</td>')
>>> soup1.find('td', class_='prodSpecAtribute').contents[0]
u'[words]'
>>> soup2 = BeautifulSoup('<td class="prodSpecAtribute">[words]</td>')
>>> soup2.find('td', class_='prodSpecAtribute').contents[0]
u'[words]'

Or to find all matches:

soup = BeautifulSoup(page)
for td in soup.find_all('td', class_='prodSpecAtribute'):
    print td.contents[0]

With BeautifulSoup 3:

soup = BeautifulSoup(page)
for td in soup.findAll('td', {'class': 'prodSpecAtribute'}):
    print td.contents[0]
Andrew Clark
  • 202,379
  • 35
  • 273
  • 306
  • I might use `find_all` here instead to handle the multiple-tag case. – DSM May 21 '13 at 19:31
  • @DSM could you please elaborate, I didn't understand your point. Thanks – Josh May 21 '13 at 19:32
  • Would this be correct: 'soup = BeautifulSoup(page) info = soup.findAll('td', class_= prodSpecAtribute)' – Josh May 21 '13 at 19:40
  • @F.J I get an error: for elements in soup.findall('td', class_= 'prodSpecAtribtue'): TypeError: 'NoneType' object is not callable – Josh May 21 '13 at 19:51
  • This indicates that `soup` is `None`, did you import BeautifulSoup and use `soup = BeautifulSoup(page)` before this? – Andrew Clark May 21 '13 at 19:53
  • See my edit, the issue you were seeing was due to a difference between versions. – Andrew Clark May 21 '13 at 20:25
  • @F.J So, this is what I'm doing now: soup = BeautifulSoup(page), info = soup.findAll('td', {'class' : 'prodSpecAtribtue'}), print info, do you know why the output is []? – Josh May 21 '13 at 20:37
  • I find the machinery of Beautiful Soup unjustified in your case. Use the solution of Zsolt Botykai in which ``(.*)`` must however be changed to ``(.*?)`` – eyquem May 21 '13 at 20:37
3

if you ask for a regex:

find2 = re.compile('<td class="prodSpecAtribute"( rowspan="2")?>(.*)</td>')

But I would use BeautifulSoup.

guettli
  • 25,042
  • 81
  • 346
  • 663
  • Great answer. But you might want to show how simple and readable the `BeautifulSoup` one-liner solution is. – abarnert May 21 '13 at 19:30
0
find2 = re.compile('<td class="prodSpecAtribute"[^>]*>(.*)</td>')

Will work. But there are better solutions for HTML parsing...

Zsolt Botykai
  • 50,406
  • 14
  • 85
  • 110
0

I would not recommend neither regex nor BeautifulSoup. There is a project pyquery http://pythonhosted.org/pyquery/ that is much faster as it uses lxml.html library, speed comparasion can be found here: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/. From my own experience BeautifulSoup is really slow.

So in your situation it is easy as this code:

>>>from pyquery import PyQuery as pq
>>>page = pq('<td class="prodSpecAtribute">[words]</td>')
>>>page('.prodSpecAtribute').text()
>>>'[words]'

Once again BS is really slow.

Visgean Skeloru
  • 2,237
  • 1
  • 24
  • 33