0

I have a few known formats in an HTML page, I need to parse the content of the tags

<TR>
    <TD align=center>Reissue of:</TD>
    <TD align=center> **VALUES_TO_FIND** </TD>
    <TD> </TD> 
</TR>
<TR>
    <TD align=center> </TD>
</TR>

basically I thought I can concatenate the HTML with a regular expression that will match anything inside the spot I'm looking for.

I know that the text before and after VALUES_TO_FIND will always be the same. how can I find it using RE? (I'm dealing with several cases and the format can repeat in several places in the page.

Robert
  • 1,899
  • 1
  • 17
  • 24
YSY
  • 1,226
  • 3
  • 13
  • 19
  • 10
    [You can't parse XHTML with regex](http://stackoverflow.com/a/1732454/159319) – jantimon Jul 02 '12 at 11:24
  • In general you could look at `re.findall()`, but I don't think that reg exp will work in your case - there is no unique prefix/suffix in the provided data sample. How will you tell "Reissue of:" from " VALUES_TO_FIND " - both have the same prefix and suffix. – mhawke Jul 02 '12 at 11:32
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – JJJ Jul 02 '12 at 11:36
  • 'Reissue of:' + REGEX + '
    ' I'm using regex because I have a format which gives me problems when using lxml\beautifulsoup)
    – YSY Jul 02 '12 at 11:39
  • You should not parse HTML with regex. It's not a reliable solution. This thread can tell you more about it. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags @MartijnPieters I missed the Python tag, so changed my answer into comment – Robert Jul 02 '12 at 11:42
  • @Juhana: The OP has yet to run into the problem in the linked question; this is not a dupe of that post. It probably is a dupe of many other questions here on SO, though. – Martijn Pieters Jul 02 '12 at 11:42
  • @YSY: Then ask questions about those problems, perhaps we can help. – Martijn Pieters Jul 02 '12 at 11:48

5 Answers5

1

This is what you are looking for:

import re

s="""
<TR>
    <TD align=center>Reissue of:</TD>
    <TD align=center> **VALUES_TO_FIND** </TD>
    <TD> </TD> 
</TR>
"""

p="""
<TR>
    <TD align=center>Reissue of:</TD>
    <TD align=center>(.*)</TD>
    <TD> </TD> 
</TR>
"""

m=re.search(p, s)
print m.group(1)
Yevgen Yampolskiy
  • 7,022
  • 3
  • 26
  • 23
0

Don't use regular expression to parse HTML (It's not a regular language). There are many threads on the topic at stackoverflow.

I recommend you to use: BeautifulSoup, Pattern and similar modules.

marbdq
  • 1,235
  • 8
  • 5
  • 1
    These days, regular expressions aren't regular either: [The true power of regular expressions](http://nikic.github.com/2012/06/15/The-true-power-of-regular-expressions.html) – Ned Batchelder Jul 02 '12 at 12:37
  • @NedBatchelder true, but that itself won't make them more readable that a CFG or a PEG. :-) – Kos Jul 02 '12 at 13:59
  • That's an awesome link, by the way. I believe the time has come to invent the term "irregular expression", or "iregex". :D – Kos Jul 02 '12 at 14:04
  • Don't get me wrong, I think you should generally use other things for HTML anyway, but "HTML is not regular" is a silly mantra. – Ned Batchelder Jul 02 '12 at 14:04
  • Nay, what's silly here is calling these inventions "regular expressions". That's like inventing rational numbers and calling them "the new improved integers", then insisting 1.5 is an integer too. – Kos Jul 02 '12 at 14:10
0

This regular expression will do:

re.findall(r'<TR>\s+<TD.+?</TD>\s+<TD align=center>(.*?)</TD>',html,re.DOTALL)

But I recommend using a parser.

Marco de Wit
  • 2,686
  • 18
  • 22
0

There are many better options for getting data out of HTML than regular expressions. Try Scrapy, for example.

Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
0

HTML isn't a regular language, using regular expression to work with it is difficult.

BeautifulSoup is a nice parser, here's an example how to use it:

from BeautifulSoup  import BeautifulSoup 

html = u'''
<TR>
    <TD align=center>Reissue of:</TD>
    <TD align=center> **VALUES_TO_FIND** </TD>
    <TD> </TD> 
</TR>
<TR>
    <TD align=center> </TD>
</TR>'''

bs = BeautifulSoup(html)

print [td.contents for td in bs.findAll('td')]

output:

[[u'Reissue of:'], [u' **VALUES_TO_FIND** '], [u' '], [u' ']]

You know what to do from here. :)

Install with pip install BeautifulSoup. Here are the docs:

http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

Kos
  • 70,399
  • 25
  • 169
  • 233