0

I'm trying to use regex to match a cell in a table, but the problem is not all cells follow the same pattern. For example, the td may take this format:

<td><a href="page101010.html">PageNumber</a></td>

or this format:

<td align="left" ></td>

Basically, the hyperlink part within the td is not present in all, its just in some.

I tried matching this situation using the below python regex code, but its failing.

match = re.search(r'<td align="left" ><?a?.+\>?(.+)\<?\/?a?\>?\<\/td\>', tdlink)

I just need 'match' to find the part enclosed in () above. However I'm getting syntax error or a None Object message.

Where am I going wrong?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
user1644208
  • 105
  • 5
  • 12
  • Basically i m using the ? to see check and act when the link is present/not. – user1644208 Sep 06 '12 at 21:01
  • 5
    First place you're going wrong - [using regexp on HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). – Izkata Sep 06 '12 at 21:02
  • +1 to @Izkata, you can parse HTML in the proper way with Python. – Lev Levitsky Sep 06 '12 at 21:03
  • Do you have to use regex for this, or can you use an HTML parsing library like [lxml](http://lxml.de/) or [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/)? – ecmendenhall Sep 06 '12 at 21:04
  • i have to use regex. i know i m wrong with the ? part of the line, as the .+ (dotplus) will spoil the previous use of '?' and fail there. But i have been thinking about generalising as much as possible.. hmm.. – user1644208 Sep 06 '12 at 21:13
  • 2
    @user1644208: Why do you have to use regex? It is *very much* the wrong tool for the job. – Martijn Pieters Sep 06 '12 at 21:14
  • yup, i think i would be better off using some string operations on the output, rather than generalizing it. (I have to have to use reges :(( ) – user1644208 Sep 06 '12 at 21:45
  • @user1644208 Your insistence makes this sound like homework. I think you should tell your teacher that he/she is also wrong, if possible. (Careful; it may hurt your grade depending on how they take it...) – Izkata Sep 06 '12 at 21:48

1 Answers1

6

You are using a regular expression, and matching XML with such expressions get too complicated, too fast.

Use a HTML parser instead, Python has several to choose from:

ElementTree example:

from xml.etree import ElementTree

tree = ElementTree.parse('filename.html')
for elem in tree.findall('tr'):
    print ElementTree.tostring(elem)
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 3
    +1 BeautifulSoup. `soup = BeautifulSoup('PageNumber\n'); [tag.text for tag in soup.findAll("td")]` returns `[u'PageNumber', '']`, and pretty much everything else is similarly simple. – DSM Sep 06 '12 at 21:08