2

I've seen this question asked a few times on stackoverflow, with no resoundingly wonderful answer. The answer always seems to be "don't use regex," without any examples of a better alternative.

For my purposes this will not be done for validation, but after the fact stripping.

I need to strip out all script tags including any content that may be between them.

Any suggestions on the best REGEX way to do this?

EDIT: PREEMPTIVE RESPONSE: I can't use HTML Purifier nor the DOMXPath feature of PHP.

kylex
  • 14,178
  • 33
  • 114
  • 175
  • 2
    Maybe related http://stackoverflow.com/questions/2505957/using-regex-to-remove-script-tags – Pierre-Olivier Mar 27 '12 at 20:34
  • Consider reading this very popular thread http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Tchoupi Mar 27 '12 at 20:57

1 Answers1

3

The reason REGEX for HTML is considered evil, is because it can (usually) easily be broken, forcing you to repeatedly rethink your pattern. If for instance you're matching

<script>.+</script>

It could be broken easily with

<script type="text/javascript">

If you use

<script.+/script>

It can also be easily broken with

< script>...

There's no end for this. If you can't use any of the methods you've stated, you could try strip_tags, but it takes a whitelist as a parameter, not a blacklist, meaning you'll need to manually allow every single tag you want to allow.

If all else fail, you could resort to RegEx, what I came up with is this

<\s*script.*/script>

But I bet someone around here could probably come and break that too.

Madara's Ghost
  • 172,118
  • 50
  • 264
  • 308
  • Thanks! Like I said above, it's not about validation, but removal of code that already exists. – kylex Mar 28 '12 at 17:44