Python Regular Expressions - extract every table cell content

Question

Possible Duplicate:
RegEx match open tags except XHTML self-contained tags

If I have a string that looks something like...

"<tr><td>123</td><td>234</td>...<td>697</td></tr>"

Basically a table row with n cells.

What's the easiest way in python to get the values of each cell. That is I just want the values "123", "234", "697" stored in a list or array or what ever is easiest.

I've tried to use regular expressions, when I use

re.match

I am not able to get it to find anything. If I try with

re.search

I can only get the first cell. But I want to get all the cells. If I can't do this with n cells, how would you do it with a fixed number of cells?

score 5 · Accepted Answer · answered Mar 23 '12 at 02:04

5

If that markup is part of a larger set of markup, you should prefer a tool with a HTML parser.
One such tool is BeautifulSoup.
Here's one way to find what you need using that tool:

>>> markup = '''"<tr><td>123</td><td>234</td>...<td>697</td></tr>"'''
>>> from bs4 import BeautifulSoup as bs
>>> soup = bs(markup)
>>> for i in soup.find_all('td'):
...     print(i.text)

Result:

123
234
697

answered Mar 23 '12 at 02:04

mechanical_meat

163,903
24
228
223

Can you recomend a good tutorial for BeautifulSoup so I can use it to get all the cells, row by row? Thanks – Reily Bourne Mar 23 '12 at 02:07
The documentation is excellent and contains several examples: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ – mechanical_meat Mar 23 '12 at 02:08

score 1 · Answer 2 · answered Mar 23 '12 at 02:03

1

Don't do this. Just use a proper HTML parser, and use something like xpath to get the elements you want.

A lot of people like lxml. For this task, you will probably want to use the BeautifulSoup backend, or use BeautifulSoup directly, because this is presumably not markup from a source known to generate well-formed, valid documents.

answered Mar 23 '12 at 02:03

Marcin

48,559
18
128
201

I prefer xml.etree.cElementTree – Vayn Mar 23 '12 at 02:05
@Vayn That's great for known-good markup. – Marcin Mar 23 '12 at 02:06
@Vayn: would you write an answer showing us how to use `xml.etree.cElementTree`? :D – mechanical_meat Mar 23 '12 at 02:06
1

@bemie I forgot xml.etree.cElementTree is only good for xhtml – Vayn Mar 23 '12 at 02:32

hostingutilities.com · Answer 3 · 2021-11-07T05:53:57.800

1

When using lxml, an element tree gets created. Each element in the element tree holds information about a tag.

from lxml import etree
root = etree.XML("<root><a x='123'>aText<b/><c/><b/></a></root>")
elements = root.findall(".//a")
tag = elements[0].tag
attr = elements[0].attrib

edited Nov 07 '21 at 05:53

answered Mar 23 '12 at 04:03

hostingutilities.com

8,894
3
41
51

Python Regular Expressions - extract every table cell content

3 Answers3

Linked