0

I have been at this on and off for a few days, but my RexEx mastery is not great. Yes I understand that RegEx is not for parsing HTML. I am doing server side "cleaning" of CKEditor input, which already does this, but only client side.

After striping none white-listed tags...

First: $html = preg_replace(' on\w+=(["\'])[^\1]*?\1', '', $html); remove all event attributes properly quoted with either ' or " quotes

Second: $html = preg_replace(' on\w+=\S+', '', $html); *remove the ones that have no quotes but still can fire, ex. onclick=blowUpTheBase()

What I would like to do is ensure the onEvent is between < & > but I can only get it to work if the onEvent attribute is the first one after a tag. Everything I try ends up capturing most of the code. I just cant get it lazy enough.

ex. $html = preg_replace('<([\s\S]?)( on\w+=\S+) ([\s\S]*?)>', '<$1 $3>', $html);

EDIT: I am going to select @colburton's answer because RegEx is what I asked for. I will also use it for my particular situation because it will due the trick. (it is an internal application anyhow)

BUT

I want to thank @Casimir et Hippolyte for his answer because it gives a great example and explanation about how to do this the "right way". I will in short order write up a function using DOMDocument and it will become my goto way of handling RTE/WYSIWYG/HTML input.

user3942918
  • 25,539
  • 11
  • 55
  • 67
Chad
  • 1,139
  • 17
  • 41
  • The `[^\1]` does not work as you think it does. You need to use `(?:(?!\1).)*` instead. Besides, you should use regex delimiters. – Wiktor Stribiżew Jul 14 '17 at 17:19
  • Quotes problems and attributes position are two of the many reasons why parsing your html with regex is a bad idea. These problems don't exist when you use DOMDocument. Enclose your html content inside a fake root element, let's say '
    ....
    ` and use this build in parser.
    – Casimir et Hippolyte Jul 14 '17 at 17:23
  • Note also that you can't trust in external data, so if there's already a part of the cleaning that is supposed to be done on client side, you must do it one more time on server side or at least to check it. – Casimir et Hippolyte Jul 14 '17 at 17:28
  • @CasimiretHippolyte DOMDocument, quite the class. I am interested in learning the proper way to parse HTML and clean it, got any links to examples? P.S. I am server side checking it, that's what this question is about. – Chad Jul 14 '17 at 18:37
  • Please do not edit answers into your question. – user3942918 Jul 16 '17 at 01:16

1 Answers1

5

Maybe I should have mentioned this from the start: This is not how you should try to filter XSS. This is purely academic inside the parameters you proposed (eg. "use RegEx").


This gets you pretty close:

preg_replace('/(<.+?)(?<=\s)on[a-z]+\s*=\s*(?:([\'"])(?!\2).+?\2|(?:\S+?\(.*?\)(?=[\s>])))(.*?>)/ig', "$1 $3", $string);

Tested on

<a href="something" onclick="bad()">text</a> onclick not in tags
<a href="something" onclick=bad()>text</a>
<a href="something" onclick="bad()" >text</a>
<meta name="keywords" content="keyword1, keyword2, keyword3">

<a href="something" onclick= "bad()">text</a> onclick not in tags
<a href="something" onclick =bad()>text</a>
<a href="something" onclick=bad('test')>text</a>
<a href="something" onclick=bad("test")>text</a>
<a href="something" onclick="bad()" >text</a>
What if I write john+onelia=love forever?

Play around here: https://regex101.com/r/GMBaQs/9

colburton
  • 4,685
  • 2
  • 26
  • 39