-2

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

What would be the best regular expression to extract only text from tags? If I have for example this kind of html markup

<tr class="classo">
<td>text1</td>
<td class="dot">text2 </td>
<td>text3</td>
<td class="dot"> text4</td>
<td class="dot">text4</td>
</tr>

Number of td tags is not fixed, also some of them will have class attribute, but I'm only interesting in getting the text from inside td tag

Community
  • 1
  • 1
Zed
  • 5,683
  • 11
  • 49
  • 81
  • What do you mean by extract; do you want the text between the td tags to be stored as a javascript variable? Or do you want to change the text within the tags? – Devon Bernard Jan 13 '13 at 00:23
  • To parse HTML, it's generally better to use an established library (such as [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) ) than putting together custom regular expressions. – Yavar Jan 13 '13 at 00:26

3 Answers3

2

Instead of spending time with regular expressions, use something designed for the task. I like BeautifulSoup:

>>> s = """
... <tr class="classo">
... <td>text1</td>
... <td class="dot">text2 </td>
... <td>text3</td>
... <td class="dot"> text4</td>
... <td class="dot">text4</td>
... </tr>
... """
>>> 
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s)
>>> soup.find_all("td")
[<td>text1</td>, <td class="dot">text2 </td>, <td>text3</td>, <td class="dot"> text4</td>, <td class="dot">text4</td>]
>>> [tag.text for tag in soup.find_all("td")]
[u'text1', u'text2 ', u'text3', u' text4', u'text4']
DSM
  • 342,061
  • 65
  • 592
  • 494
  • I actually have table row in BeautifulSoup selected, but I want to extract content from row by row, so I can place it in a dictionary and have something like this rowDict = {'row1': ['text1','text2', 'text3'], 'row2': ['text1', 'text2','text3...']} – Zed Jan 13 '13 at 00:48
  • A) if you had additional requirements, edit the original question to include them. B) if your keys really are nothing more than "row1", "row2", etc., then just use a list of lists and get each row's data as `rowData[0]`, `rowData[1]` and so on. – PaulMcG Jan 13 '13 at 01:12
1

The regex <td.*?>(.*?)<\/td> will properly do.

But may I recommend you to use the HTMLParser Module or BeautifulSoup

Took me the time to write you another example using the HTMLParser:

from HTMLParser import HTMLParser

class TDExtractor(HTMLParser):

  def handle_starttag(self, tag, attrs):
      if tag == 'td':
          self.recording = True

  def handle_endtag(self, tag):
      if tag == 'td':
          self.recording = False

  def handle_data(self, data):
      if self.recording:
          self.data.append(data)

  def reset(self):
      HTMLParser.reset(self)
      self.data = []
      self.recording = False

And in action:

> tdextractor = TDExtractor()
> tdextractor.feed(some_htmldata)
> print(tdextractor.data) # will print a list with all the td data.
brunsgaard
  • 5,066
  • 2
  • 16
  • 15
1

Regular expressions were not designed to parse HTML. HTML is not a regular language and cannot be parsed very easily with regular expressions.

A lot of people like BeautifulSoup, but it is pretty slow (another source) and not as good as lxml, which can even use BeautifulSoup as a parser as needed.

Here's a solution using lxml.

>>> import lxml.html
>>> html = lxml.html.fromstring("""
... <tr class="classo">
... <td>text1</td>
... <td class="dot">text2 </td>
... <td>text3</td>
... <td class="dot"> text4</td>
... <td class="dot">text4</td>
... </tr>""")
>>> print [e.text for e in html.xpath("td")]
['text1', 'text2 ', 'text3', ' text4', 'text4']
Fredrick Brennan
  • 7,079
  • 2
  • 30
  • 61