0

I want to replace a regex with '*', but only if the regex is out side of <>. The whole point is to not interfere with the html tags.

I use this to replace:

re.sub(r'SOMEREGEX(?=[^>]*(<|$))', '*', line)

However I ran into his problem: if my regex is:

f.*k

Then this:

fzzzzzzzzz<HTMLTAG>zzzzzzzk

Would become an '*', which I don't want. How do I overcome this problem?

Constraints:

-All brackets are matched

-No nested brackets

-SOMEREGEX is provided by the user. I prefer not changing that.

Squall Leohart
  • 657
  • 2
  • 8
  • 20

2 Answers2

2

You could try replacing the . character - "any character at all" - with the character class [^<>], which matches any character except the angle brackets, <>. This would give the regex f[^<>]*k. This would match facebook but not face<b>book.

There are still things that can go wrong with this, though. Have you considered using a proper HTML parser instead of regular expressions? BeautifulSoup is easy, tasty and fun.

Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99
Li-aung Yip
  • 12,320
  • 5
  • 34
  • 49
  • The thing is, I don't have any control over the given regex. I just want to protect all my tags, and they can be anywhere in the middle of the sentence – Squall Leohart Jun 15 '12 at 23:27
  • 1
    Can you please clarify your constraints? (Edit your question to include any additional information that's relevant.) – Li-aung Yip Jun 15 '12 at 23:30
0

Search between the end and start angle brackets:

re.sub(r'(^|>)f[^<]*k(<|$)', r'\1*\2', line)

The \1 and \2 are required to replace the angle brackets that the pattern may have removed from line.

gvl
  • 903
  • 7
  • 16