3

I'm trying to anonymize a HTML string with regex, for an SQL query.

https://regex101.com/r/QWt1E1/1

(?<!\<)[^<>\s](?!\>)
<p><em>Hi [User</em></p>
<p><em>Tack f&ouml;r visat intresse.</em></p>
<p><em>Good luck!</em><em>&nbsp;</em></p>
<p><em>Sincerely</em></p>
<p><em>nn nnnnn</nm></p>
<p><em>nnnn nnnnnnnn nnnnn nnnnnnnnn</nm></p>
<p><em>nnnn nnnnn</nm><em>nnnnnn</nm></p>
<p><em>nnnnnnnnn</nm></p>

The plan was to replace every character that is not within <>, with an n. It almost works, but in my example it replaces the e in </em>. Not sure why and how to fix that.

How can I adjust the regex to not replace the e in the example?

Markus Hedlund
  • 23,374
  • 22
  • 80
  • 109

1 Answers1

5

Negative lookahead for [^<>]*> instead of just >, to ensure that the current position is not followed by a > before any other angle brackets (because that would indicate you're currently inside a tag).

This also means that you can drop the lookbehind:

[^<>\s](?![^<>]*>)
          ^^^^^^

https://regex101.com/r/QWt1E1/3

Still, it would be better to parse the HTML using an HTML parser, if at all possible

CertainPerformance
  • 356,069
  • 52
  • 309
  • 320