python regex issue matching links within td element

Question

I'm trying to use regex to match a cell in a table, but the problem is not all cells follow the same pattern. For example, the td may take this format:

<td><a href="page101010.html">PageNumber</a></td>

or this format:

<td align="left" ></td>

Basically, the hyperlink part within the td is not present in all, its just in some.

I tried matching this situation using the below python regex code, but its failing.

match = re.search(r'<td align="left" ><?a?.+\>?(.+)\<?\/?a?\>?\<\/td\>', tdlink)

I just need 'match' to find the part enclosed in () above. However I'm getting syntax error or a None Object message.

Where am I going wrong?

Basically i m using the ? to see check and act when the link is present/not. — user1644208, Sep 06 '12 at 21:01
First place you're going wrong - [using regexp on HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). — Izkata, Sep 06 '12 at 21:02
+1 to @Izkata, you can parse HTML in the proper way with Python. — Lev Levitsky, Sep 06 '12 at 21:03
Do you have to use regex for this, or can you use an HTML parsing library like [lxml](http://lxml.de/) or [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/)? — ecmendenhall, Sep 06 '12 at 21:04
i have to use regex. i know i m wrong with the ? part of the line, as the .+ (dotplus) will spoil the previous use of '?' and fail there. But i have been thinking about generalising as much as possible.. hmm.. — user1644208, Sep 06 '12 at 21:13
@user1644208: Why do you have to use regex? It is *very much* the wrong tool for the job. — Martijn Pieters, Sep 06 '12 at 21:14
yup, i think i would be better off using some string operations on the output, rather than generalizing it. (I have to have to use reges :(( ) — user1644208, Sep 06 '12 at 21:45
@user1644208 Your insistence makes this sound like homework. I think you should tell your teacher that he/she is also wrong, if possible. (Careful; it may hurt your grade depending on how they take it...) — Izkata, Sep 06 '12 at 21:48

Martijn Pieters · Answer 1 · 2012-09-06T21:10:39.113

6

You are using a regular expression, and matching XML with such expressions get too complicated, too fast.

Use a HTML parser instead, Python has several to choose from:

ElementTree is part of the standard library
BeautifulSoup is a popular 3rd party library
lxml is a fast and feature-rich C-based library.

ElementTree example:

from xml.etree import ElementTree

tree = ElementTree.parse('filename.html')
for elem in tree.findall('tr'):
    print ElementTree.tostring(elem)

edited Sep 06 '12 at 21:10

answered Sep 06 '12 at 21:05

Martijn Pieters

1,048,767
296
4,058
3,343

3

+1 BeautifulSoup. `soup = BeautifulSoup('PageNumber\n'); [tag.text for tag in soup.findAll("td")]` returns `[u'PageNumber', '']`, and pretty much everything else is similarly simple. – DSM Sep 06 '12 at 21:08

python regex issue matching links within td element

1 Answers1