Replace character in HTML tag with REGEX

Question

Possible Duplicate:
Replace all < and > that are NOT part of an HTML tag

Using Python
I know how much everyone here hates REGEX questions surrounding HTML tags, but I am just doing this as a exercise to help my learn REGEX.

Replace (1 can be any character):

<b>< </b>
<b> < </b>
<b> <</b>
<b><</b>
<b><111</b>
<b>11<11</b>
<b>111<</b>
<b>11<11</b>

<b>
<<<
</b>

With:

<b>& </b>
<b> & </b>
<b> &</b>
<b>&</b>
<b>&111</b>
<b>11&11</b>
<b>111&</b>
<b>11&11</b>

<b>
&
</b>

I am searched in the interwebs and tried many of my own solutions. Please, is this possible? And if so, how?

My best guess was something like:

re.sub(r'(?<=>)(.*?)<(.*?)(?=</)', r'\1&lt;\2', string)

But that falls apart with re.DOTALL and '<<<'+ etc.

"doing this as a exercise to help my learn REGEX". Trying to learn regex by processing HTML won't help you learn regular expressions at all. Find another kind of data, almost any other kind of data to learn regex on. Avoid HTML, SGML and XML, however, because the recursion makes learning regex difficult. — S.Lott, Jul 11 '11 at 21:58
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Brandon Boone, Jul 11 '11 at 21:58
tl;dr regex was never meant to be used to parse HTML, so you're doing it wrong either way. — BoltClock, Jul 11 '11 at 22:00
If you're learning regular expressions, you should learn to use them on regular languages. They're not always the swiss army knife people make them out to be. So, [not parsing HTML with regexps](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) still applies. Also, possible duplicate of [Replace all < and > that are NOT part of an HTML tag](http://stackoverflow.com/questions/5463647/replace-all-and-that-are-not-part-of-an-html-tag). — You, Jul 11 '11 at 22:02
Those posts were largely unhelpful and like these comments just complain about using HTML with REGEX :/ — Peach Passion, Jul 12 '11 at 00:57
I wonder why one has to ask and answer the parsing-html-regexes-question each day. A clear -1 — , Jul 12 '11 at 03:21

score 1 · Answer 1 · answered Jul 11 '11 at 22:20

1

I sincerely hope this is never used on actual HTML, but here is a solution that works for your example data. Note that it replaces with < like your sample code, not & like in your sample data.

re.sub(r'<+([^<>]*?)(?=</)', r'&lt;\1', your_string)

answered Jul 11 '11 at 22:20

Andrew Clark

202,379
35
273
306

fransua · Answer 2 · 2011-07-11T22:21:21.390

0

if a is your string, this seems to work:

re.sub('<+([^b/])','&\\1',a)

and a second version, more generic...

re.sub('(<[^<>]+>)([^<>]*)<+([^<>]*)(<[^<>]+>)','\\1\\2&\\3\\4',a)

edited Jul 11 '11 at 22:21

answered Jul 11 '11 at 22:09

fransua

1,559
13
30

score 0 · Accepted Answer · answered Jul 11 '11 at 22:33

0

You could use something like this:

re.sub(r'(?:<(?!/?b>))+', '&', string)

And if you'd want it to work with (some) other tags, you could use something like this:

re.sub(r'(?:<(?!/?\w+[^<>]*>))+', '&', string)

answered Jul 11 '11 at 22:33

Qtax

33,241
9
83
121

Anyway to fix it so = and not ? – Peach Passion Jul 12 '11 at 02:06
In the first example you could just add a `?` after the `b`, something like `(?:<(?!/?b?>))+`. – Qtax Jul 12 '11 at 09:28

score 0 · Answer 4 · answered Jul 11 '11 at 23:39

0

This tested regex works for your given test data:

reobj = re.compile(r"""
    # Match left angle brackets not part of HTML tag.
    <+               # One or more < but only if
    (?=[^<>]*</\w+)  # inside HTML element contents.
    """, re.VERBOSE)
result = reobj.sub("&", subject)

answered Jul 11 '11 at 23:39

ridgerunner

33,777
5
57
69

1

Hey thanks for the -1! You guys are so smart! I wish I wuz as smart as you! – ridgerunner Jul 12 '11 at 04:44

Replace character in HTML tag with REGEX

4 Answers4