How to create a regex for the following scenario (HTML)?

Question

I have a few known formats in an HTML page, I need to parse the content of the tags

<TR>
    <TD align=center>Reissue of:</TD>
    <TD align=center> **VALUES_TO_FIND** </TD>
    <TD> </TD> 
</TR>
<TR>
    <TD align=center> </TD>
</TR>

basically I thought I can concatenate the HTML with a regular expression that will match anything inside the spot I'm looking for.

I know that the text before and after VALUES_TO_FIND will always be the same. how can I find it using RE? (I'm dealing with several cases and the format can repeat in several places in the page.

[You can't parse XHTML with regex](http://stackoverflow.com/a/1732454/159319) — jantimon, Jul 02 '12 at 11:24
In general you could look at `re.findall()`, but I don't think that reg exp will work in your case - there is no unique prefix/suffix in the provided data sample. How will you tell "Reissue of:" from " VALUES_TO_FIND " - both have the same prefix and suffix. — mhawke, Jul 02 '12 at 11:32
possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — JJJ, Jul 02 '12 at 11:36
'Reissue of:' + REGEX + '
' I'm using regex because I have a format which gives me problems when using lxml\beautifulsoup) — YSY, Jul 02 '12 at 11:39
You should not parse HTML with regex. It's not a reliable solution. This thread can tell you more about it. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags @MartijnPieters I missed the Python tag, so changed my answer into comment — Robert, Jul 02 '12 at 11:42
@Juhana: The OP has yet to run into the problem in the linked question; this is not a dupe of that post. It probably is a dupe of many other questions here on SO, though. — Martijn Pieters, Jul 02 '12 at 11:42
@YSY: Then ask questions about those problems, perhaps we can help. — Martijn Pieters, Jul 02 '12 at 11:48

score 1 · Accepted Answer · answered Jul 02 '12 at 15:09

This is what you are looking for:

import re

s="""
<TR>
    <TD align=center>Reissue of:</TD>
    <TD align=center> **VALUES_TO_FIND** </TD>
    <TD> </TD> 
</TR>
"""

p="""
<TR>
    <TD align=center>Reissue of:</TD>
    <TD align=center>(.*)</TD>
    <TD> </TD> 
</TR>
"""

m=re.search(p, s)
print m.group(1)

score 0 · Answer 2 · answered Jul 02 '12 at 11:56

0

Don't use regular expression to parse HTML (It's not a regular language). There are many threads on the topic at stackoverflow.

I recommend you to use: BeautifulSoup, Pattern and similar modules.

answered Jul 02 '12 at 11:56

marbdq

1,235
8
5

1

These days, regular expressions aren't regular either: [The true power of regular expressions](http://nikic.github.com/2012/06/15/The-true-power-of-regular-expressions.html) – Ned Batchelder Jul 02 '12 at 12:37
@NedBatchelder true, but that itself won't make them more readable that a CFG or a PEG. :-) – Kos Jul 02 '12 at 13:59
That's an awesome link, by the way. I believe the time has come to invent the term "irregular expression", or "iregex". :D – Kos Jul 02 '12 at 14:04
Don't get me wrong, I think you should generally use other things for HTML anyway, but "HTML is not regular" is a silly mantra. – Ned Batchelder Jul 02 '12 at 14:04
Nay, what's silly here is calling these inventions "regular expressions". That's like inventing rational numbers and calling them "the new improved integers", then insisting 1.5 is an integer too. – Kos Jul 02 '12 at 14:10

Marco de Wit · Answer 3 · 2012-07-02T14:31:03.800

0

This regular expression will do:

re.findall(r'<TR>\s+<TD.+?</TD>\s+<TD align=center>(.*?)</TD>',html,re.DOTALL)

But I recommend using a parser.

edited Jul 02 '12 at 14:31

answered Jul 02 '12 at 11:58

Marco de Wit

2,686
18
22

score 0 · Answer 4 · answered Jul 02 '12 at 12:39

0

There are many better options for getting data out of HTML than regular expressions. Try Scrapy, for example.

answered Jul 02 '12 at 12:39

Ned Batchelder

364,293
75
561
662

score 0 · Answer 5 · answered Jul 02 '12 at 13:49

HTML isn't a regular language, using regular expression to work with it is difficult.

BeautifulSoup is a nice parser, here's an example how to use it:

from BeautifulSoup  import BeautifulSoup 

html = u'''
<TR>
    <TD align=center>Reissue of:</TD>
    <TD align=center> **VALUES_TO_FIND** </TD>
    <TD> </TD> 
</TR>
<TR>
    <TD align=center> </TD>
</TR>'''

bs = BeautifulSoup(html)

print [td.contents for td in bs.findAll('td')]

output:

[[u'Reissue of:'], [u' **VALUES_TO_FIND** '], [u' '], [u' ']]

You know what to do from here. :)

Install with pip install BeautifulSoup. Here are the docs:

http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

How to create a regex for the following scenario (HTML)?

5 Answers5