2

does anyone know of a good regular expression to remove events from html.

For example the string:
"<h1 onmouseover="top.location='http://www.google.com">Large Text</h1> Becomes "<h1>Large Text</h1>
So HTML tags are preserved but events like onmouseover, onmouseout, onclick, etc. are removed.

Thanks in Advance!

James Cal
  • 31
  • 1
  • 3
  • 1
    -1 (X)HTML is not a regular language. If you're doing this as some sort of "sanitization", it's especially unsafe - there may be some edge cases which are parsed as JavaScript by certain tag soup parsers; an obvious candidate is IE's conditional comments. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – tc. Oct 02 '10 at 02:30

2 Answers2

5

How about:

data.replace(/ on\w+="[^"]*"/g, '');

Edit from the comments:

This is intended to be run on your markup as a one time thing. If you're trying to remove events dynamically during the execution of the page, that's a slightly different story. A javascript library like jQuery makes it extremely easy, though:

$('*').unbind();

Edit:

Restricting this to only within tags is a lot harder. I'm not confident it can be done with a single regex expression. However, this should get you by if no one can come up with one:

var matched;

do
{
    matched = false;
    data = data.replace(/(<[^>]+)( on\w+="[^"]*")+/g,
        function(match, goodPart)
        { 
            matched = true;
            return goodPart;
        });
} while(matched);

Edit:

I surrender at writing a single regex for this. There must be some way to check the context of a match without actually capturing the beginning of the tag in your match, but my RegEx-fu is not strong enough. This is the most elegant solution I'm going to come up with:

data = data.replace(/<[^>]+/g, function(match)
{
    return match.replace(/ on\w+="[^"]*"/g, '');
});
Ian Henry
  • 22,255
  • 4
  • 50
  • 61
  • very good answer. Just a feedback for james that it wont remove events on html that have been placed unobtrusively and also it wont remove some of the click events triggered through href='javascript:function()' – sushil bharwani Oct 02 '10 at 01:10
  • Thank you for answering Ian. I am just replacing raw html, so the regex looks good. However, is there a way to specify it so that it matches only if the string is inside a tag? currently the regex would replace "onclick events can be written as onclick="something" " to "onclick events can be written as ". Any ideas? Thanks – James Cal Oct 02 '10 at 01:34
  • I appreciate the effort! I think your final attempt will work perfectly for me. Thank you :) – James Cal Oct 02 '10 at 05:51
  • The regex fails 1. when the unwanted attributes are written in capital letters e.g 2. It fails if there is no space between before the unwanted event attr e.g .... Solution: data.replace(/ ?on\w+="[^"]*"/gi, '') – The concise Jul 18 '22 at 16:55
0

Here's a pure JS way to do it:

function clean(html) {
    function stripHTML(){
        html = html.slice(0, strip) + html.slice(j);
        j = strip;
        strip = false;
    }
    function isValidTagChar(str) {
        return str.match(/[a-z?\\\/!]/i);
    }
    var strip = false; //keeps track of index to strip from
    var lastQuote = false; //keeps track of whether or not we're inside quotes and what type of quotes
    for(var i=0; i<html.length; i++){
        if(html[i] === "<" && html[i+1] && isValidTagChar(html[i+1])) {
            i++;
            //Enter element
            for(var j=i; j<html.length; j++){
                if(!lastQuote && html[j] === ">"){
                    if(strip) {
                        stripHTML();
                    }
                    i = j;
                    break;
                }
                if(lastQuote === html[j]){
                    lastQuote = false;
                    continue;
                }
                if(!lastQuote && html[j-1] === "=" && (html[j] === "'" || html[j] === '"')){
                    lastQuote = html[j];
                }
                //Find on statements
                if(!lastQuote && html[j-2] === " " && html[j-1] === "o" && html[j] === "n"){
                    strip = j-2;
                }
                if(strip && html[j] === " " && !lastQuote){
                    stripHTML();
                }
            }
        }
    }
    return html;
}
winhowes
  • 7,845
  • 5
  • 28
  • 39