Regular expression for selecting text inside

Question

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

What would be the best regular expression to extract only text from tags? If I have for example this kind of html markup

<tr class="classo">
<td>text1</td>
<td class="dot">text2 </td>
<td>text3</td>
<td class="dot"> text4</td>
<td class="dot">text4</td>
</tr>

Number of td tags is not fixed, also some of them will have class attribute, but I'm only interesting in getting the text from inside td tag

What do you mean by extract; do you want the text between the td tags to be stored as a javascript variable? Or do you want to change the text within the tags? — Devon Bernard, Jan 13 '13 at 00:23
To parse HTML, it's generally better to use an established library (such as [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) ) than putting together custom regular expressions. — Yavar, Jan 13 '13 at 00:26

score 2 · Answer 1 · answered Jan 13 '13 at 00:25

2

Instead of spending time with regular expressions, use something designed for the task. I like BeautifulSoup:

>>> s = """
... <tr class="classo">
... <td>text1</td>
... <td class="dot">text2 </td>
... <td>text3</td>
... <td class="dot"> text4</td>
... <td class="dot">text4</td>
... </tr>
... """
>>> 
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s)
>>> soup.find_all("td")
[<td>text1</td>, <td class="dot">text2 </td>, <td>text3</td>, <td class="dot"> text4</td>, <td class="dot">text4</td>]
>>> [tag.text for tag in soup.find_all("td")]
[u'text1', u'text2 ', u'text3', u' text4', u'text4']

answered Jan 13 '13 at 00:25

DSM

342,061
65
592
494

I actually have table row in BeautifulSoup selected, but I want to extract content from row by row, so I can place it in a dictionary and have something like this rowDict = {'row1': ['text1','text2', 'text3'], 'row2': ['text1', 'text2','text3...']} – Zed Jan 13 '13 at 00:48
A) if you had additional requirements, edit the original question to include them. B) if your keys really are nothing more than "row1", "row2", etc., then just use a list of lists and get each row's data as `rowData[0]`, `rowData[1]` and so on. – PaulMcG Jan 13 '13 at 01:12

brunsgaard · Answer 2 · 2013-01-13T00:34:48.643

The regex <td.*?>(.*?)<\/td> will properly do.

But may I recommend you to use the HTMLParser Module or BeautifulSoup

Took me the time to write you another example using the HTMLParser:

from HTMLParser import HTMLParser

class TDExtractor(HTMLParser):

  def handle_starttag(self, tag, attrs):
      if tag == 'td':
          self.recording = True

  def handle_endtag(self, tag):
      if tag == 'td':
          self.recording = False

  def handle_data(self, data):
      if self.recording:
          self.data.append(data)

  def reset(self):
      HTMLParser.reset(self)
      self.data = []
      self.recording = False

And in action:

> tdextractor = TDExtractor()
> tdextractor.feed(some_htmldata)
> print(tdextractor.data) # will print a list with all the td data.

score 1 · Answer 3 · answered Jan 13 '13 at 00:33

Regular expressions were not designed to parse HTML. HTML is not a regular language and cannot be parsed very easily with regular expressions.

A lot of people like BeautifulSoup, but it is pretty slow (another source) and not as good as lxml, which can even use BeautifulSoup as a parser as needed.

Here's a solution using lxml.

>>> import lxml.html
>>> html = lxml.html.fromstring("""
... <tr class="classo">
... <td>text1</td>
... <td class="dot">text2 </td>
... <td>text3</td>
... <td class="dot"> text4</td>
... <td class="dot">text4</td>
... </tr>""")
>>> print [e.text for e in html.xpath("td")]
['text1', 'text2 ', 'text3', ' text4', 'text4']

Regular expression for selecting text inside

3 Answers3