0

Possible Duplicate:
Replace all < and > that are NOT part of an HTML tag

  1. Using Python
  2. I know how much everyone here hates REGEX questions surrounding HTML tags, but I am just doing this as a exercise to help my learn REGEX.

Replace (1 can be any character):

<b>< </b>
<b> < </b>
<b> <</b>
<b><</b>
<b><111</b>
<b>11<11</b>
<b>111<</b>
<b>11<11</b>

<b>
<<<
</b>

With:

<b>& </b>
<b> & </b>
<b> &</b>
<b>&</b>
<b>&111</b>
<b>11&11</b>
<b>111&</b>
<b>11&11</b>

<b>
&
</b>

I am searched in the interwebs and tried many of my own solutions. Please, is this possible? And if so, how?

My best guess was something like:

re.sub(r'(?<=>)(.*?)<(.*?)(?=</)', r'\1&lt;\2', string)

But that falls apart with re.DOTALL and '<<<'+ etc.

Community
  • 1
  • 1
Peach Passion
  • 457
  • 3
  • 11
  • 7
    "doing this as a exercise to help my learn REGEX". Trying to learn regex by processing HTML won't help you learn regular expressions at all. Find another kind of data, almost any other kind of data to learn regex on. Avoid HTML, SGML and XML, however, because the recursion makes learning regex difficult. – S.Lott Jul 11 '11 at 21:58
  • 3
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Brandon Boone Jul 11 '11 at 21:58
  • 3
    tl;dr regex was never meant to be used to parse HTML, so you're doing it wrong either way. – BoltClock Jul 11 '11 at 22:00
  • 2
    If you're learning regular expressions, you should learn to use them on regular languages. They're not always the swiss army knife people make them out to be. So, [not parsing HTML with regexps](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) still applies. Also, possible duplicate of [Replace all < and > that are NOT part of an HTML tag](http://stackoverflow.com/questions/5463647/replace-all-and-that-are-not-part-of-an-html-tag). – You Jul 11 '11 at 22:02
  • Those posts were largely unhelpful and like these comments just complain about using HTML with REGEX :/ – Peach Passion Jul 12 '11 at 00:57
  • I wonder why one has to ask and answer the parsing-html-regexes-question each day. A clear -1 –  Jul 12 '11 at 03:21

4 Answers4

1

I sincerely hope this is never used on actual HTML, but here is a solution that works for your example data. Note that it replaces with &lt; like your sample code, not & like in your sample data.

re.sub(r'<+([^<>]*?)(?=</)', r'&lt;\1', your_string)
Andrew Clark
  • 202,379
  • 35
  • 273
  • 306
0

if a is your string, this seems to work:

re.sub('<+([^b/])','&\\1',a)

and a second version, more generic...

re.sub('(<[^<>]+>)([^<>]*)<+([^<>]*)(<[^<>]+>)','\\1\\2&\\3\\4',a)
fransua
  • 1,559
  • 13
  • 30
0

You could use something like this:

re.sub(r'(?:<(?!/?b>))+', '&', string)

And if you'd want it to work with (some) other tags, you could use something like this:

re.sub(r'(?:<(?!/?\w+[^<>]*>))+', '&', string)
Qtax
  • 33,241
  • 9
  • 83
  • 121
0

This tested regex works for your given test data:

reobj = re.compile(r"""
    # Match left angle brackets not part of HTML tag.
    <+               # One or more < but only if
    (?=[^<>]*</\w+)  # inside HTML element contents.
    """, re.VERBOSE)
result = reobj.sub("&", subject)
ridgerunner
  • 33,777
  • 5
  • 57
  • 69