2

I am actually trying to make the result of a wysihtml5 editor secure.
Basically, users cannot enter script/forms/etc tags.

I cannot remove all tags as some of them are used to display the content as wished.
(eg : <h1> to display a title)

The problem is that users are still able to add DOM event listeners binded to some unwanted code.
(eg : <h1 onclick="alert('Houston, got a problem');"></h1>)

I would like to remove all event listeners inside a div (for all descendants inside that div).
The solution I actually tried to use is to check the code as a string to find and replace unwanted content, which worked for the unwanted tags.

What I actually need is a regex matching all event listeners inside all tags.
Something like "select all [on*] between < and >".
Examples :
<h1 onclick=""></h1> => Should match
<h1 onnewevent=""></h1> => Should match
<h1>onclick=""</h1> => Should NOT match

Thanks in advance for your help ;)

qdrien
  • 133
  • 1
  • 11
  • `

    yo dawg, uversion 4.0

    ` - You'd need to write something to parse the HTML as a string if you don't want to create the DOM elements. RegEx alone is not the right tool for this.
    – tenub Mar 18 '14 at 16:41
  • Recommended reading: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – akirilov Mar 18 '14 at 19:53
  • Thanks for the link, finally used a parser – qdrien Mar 19 '14 at 10:23

1 Answers1

2

Shouldn't be parsing html with regex.
If you really want to though, this is a quick and dirty way
(by no means complete).

It just looks for opening 'onevent' tag with its closing tag right after it.
If there will be something else inbetween, just add a .*? between tags.

 #  <([^<>\s]+)\s[^<>]*on[^<>="]+=[^<>]*></\1\s*>
 # /<([^<>\s]+)\s[^<>]*on[^<>="]+=[^<>]*><\/\1\s*>/

 < 
 ( [^<>\s]+ )                    # (1), 'Tag'
 \s 
 [^<>]* on [^<>="]+ = [^<>]*     # On... = event
 >
 </ \1 \s* >                     # Backref to 'Tag'

Perl test case

$/ = undef;

$str = <DATA>;

while ( $str =~ /<([^<>\s]+)\s[^<>]*on[^<>="]+=[^<>]*><\/\1\s*>/g )
{
    print "'$&'\n";
}


__DATA__
(eg : <h1 onclick="alert('Houston, got a problem');"></h1>) 

I would like to remove all event listeners inside a div
(for all descendants inside that div).
The solution I actually tried to use is to check the code as
a string to find and replace unwanted content,
which worked for the unwanted tags. 

What I actually need is a regex matching all event
listeners inside all tags.
Something like "select all [on*] between < and >".
Examples :
<h1 onclick=""></h1> => Should match
<h1 onnewevent=""></h1> => Should match
<h1>onclick=""</h1> => Should NOT match 

Output >>

'<h1 onclick="alert('Houston, got a problem');"></h1>'
'<h1 onclick=""></h1>'
'<h1 onnewevent=""></h1>'