2

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

If I have a string that looks something like...

"<tr><td>123</td><td>234</td>...<td>697</td></tr>"

Basically a table row with n cells.

What's the easiest way in python to get the values of each cell. That is I just want the values "123", "234", "697" stored in a list or array or what ever is easiest.

I've tried to use regular expressions, when I use

re.match

I am not able to get it to find anything. If I try with

re.search 

I can only get the first cell. But I want to get all the cells. If I can't do this with n cells, how would you do it with a fixed number of cells?

Community
  • 1
  • 1
Reily Bourne
  • 5,117
  • 9
  • 30
  • 41

3 Answers3

5

If that markup is part of a larger set of markup, you should prefer a tool with a HTML parser.
One such tool is BeautifulSoup.
Here's one way to find what you need using that tool:

>>> markup = '''"<tr><td>123</td><td>234</td>...<td>697</td></tr>"'''
>>> from bs4 import BeautifulSoup as bs
>>> soup = bs(markup)
>>> for i in soup.find_all('td'):
...     print(i.text)

Result:

123
234
697
mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
1

Don't do this. Just use a proper HTML parser, and use something like xpath to get the elements you want.

A lot of people like lxml. For this task, you will probably want to use the BeautifulSoup backend, or use BeautifulSoup directly, because this is presumably not markup from a source known to generate well-formed, valid documents.

Marcin
  • 48,559
  • 18
  • 128
  • 201
1

When using lxml, an element tree gets created. Each element in the element tree holds information about a tag.

from lxml import etree
root = etree.XML("<root><a x='123'>aText<b/><c/><b/></a></root>")
elements = root.findall(".//a")
tag = elements[0].tag
attr = elements[0].attrib
hostingutilities.com
  • 8,894
  • 3
  • 41
  • 51