Javascript - Regex/replace optimization

Question

I have a script which allows to replace undesired HTML tags and escape quotes to "improve" security and prevent mainly script tag and onload injection, etc.... This script is used to "texturize" content retrieved from innerHTML.

However, it multiples near by 3 my execution time (in a loop). I would like to know if there is a better way or better regex to do it:

function safe_content( text ) {

    text = text.replace( /<script[^>]*>.*?<\/script>/gi, '' );
    text = text.replace( /(<p[^>]*>|<\/p>)/g, '' );
    text = text.replace( /'/g, '&#8217;' ).replace( /&#039;/g, '&#8217;' ).replace( /[\u2019]/g, '&#8217;' );
    text = text.replace( /"/g, '&#8221;' ).replace( /&#034;/g, '&#8221;' ).replace( /&quot;/g, '&#8221;' ).replace( /[\u201D]/g, '&#8221;' );
    text = text.replace( /([\w]+)=&#[\d]+;(.+?)&#[\d]+;/g, '$1="$2"' );
    return text.trim();

};

EDIT: here a fiddle: https://fiddle.jshell.net/srnoe3s4/1/. Fiddle don't like script tags in javascript string apparently so I didn't add it.

Thank you for your reply. But my script is writted in plain JS (Vanilla). I don't use jQuery and I would like to keep HTML tags but secure them. So, I can't get content from `textContent`. I must get `innerHTML` and after make sure this content is "safe"... Do you have an alternative to regex? — freaky, Apr 21 '17 at 09:38
*Wrong. Client side javascript sanitization can't help avoiding hack attempts.* — revo, Apr 21 '17 at 09:40
If your content is already in innerHtml, how do you make sure it hasn't already been run (especially script tags)? Just interested in the security side of things. — vinntec, Apr 21 '17 at 09:40
This content can be contained in an attribute or in an object. So, it's not necessary parse by HTML or render... — freaky, Apr 21 '17 at 09:42
If you are interested in fewer `replace()` calls, you may join those with the same replacement patterns using alternation - `text = text.replace( /'|'|\u2019/g, '’' );` and `text = text.replace( /"|&(?:#034|quot);|\u201D/g, '”' );` — Wiktor Stribiżew, Apr 21 '17 at 09:45
@musefan Using either jQuery or its equivalent Vanilla JS code (as you said) will evaluate the code, and the whole point of this will be for nothing. OP wants to remove those tags before parsing and evaluating begins. — ibrahim mahrir, Apr 21 '17 at 09:46
I have added a fiddle to compare performance with regex and without... I will try to add the replacement patterns. — freaky, Apr 21 '17 at 09:51
@musefan I just tested it, I think you're right. Evaluatin happens when they get appended to the DOM/ — ibrahim mahrir, Apr 21 '17 at 09:51
Anyway... scrap my suggestion, it seems reducing the regex calls would be a better result — musefan, Apr 21 '17 at 09:58
And do you have an example? I have made a fiddle with the suggestion of replacement patterns and it improve by 2 the execution time: https://fiddle.jshell.net/srnoe3s4/1/ — freaky, Apr 21 '17 at 10:01
@freak: Removing the groups `()` seems to help a bit if you can get away with that — musefan, Apr 21 '17 at 10:06
Just one of the problems with your regex approach. Try running this string through it: `pt>some malicious code here.` — , Apr 21 '17 at 10:10
There are a lot of questions on SO and search results on Google about this topic. Just e.g. http://stackoverflow.com/questions/295566/sanitize-rewrite-html-on-the-client-side Don't do such things on your own, if it is not the main purpose of your product. — ssc-hrep3, Apr 21 '17 at 10:40

score -1 · Accepted Answer · answered Apr 21 '17 at 12:17

I'll just deal with performance and naive security checks since writing a sanitizer is not something you can do on the corner of a table. If you want to save time, avoid calling multiple times replace() if you replace with the same value, which leads you to this:

function safe_content( text ) {
    text = text.replace( /<script[^>]*>.*?<\/script>|(<\/?p[^>]*>)/gi, '' );
    text = text.replace( /'|&#039;|[\u2019]/g, '&#8217;');
    text = text.replace( /"|&#034;|&quot;|[\u201D]/g, '&#8221;' )
    text = text.replace( /([\w]+)=&#[\d]+;(.+?)&#[\d]+;/g, '$1="$2"' );
    return text.trim();
};

If you take into account dan1111's comment about weird string input that will break this implementation, you can add while(/foo/.test(input)) to avoid the issue:

function safe_content( text ) {
    while(/<script[^>]*>.*?<\/script>|(<\/?p[^>]*>)/gi.test(text))
        text = text.replace( /<script[^>]*>.*?<\/script>|(<\/?p[^>]*>)/gi, '' );
    while(/'|&#039;|[\u2019]/g.test(text))
        text = text.replace( /'|&#039;|[\u2019]/g, '&#8217;');
    while(/"|&#034;|&quot;|[\u201D]/g.test(text))
        text = text.replace( /"|&#034;|&quot;|[\u201D]/g, '&#8221;' )
    while(/([\w]+)=&#[\d]+;(.+?)&#[\d]+;/g.test(text))
        text = text.replace( /([\w]+)=&#[\d]+;(.+?)&#[\d]+;/g, '$1="$2"' );
    return text.trim();
};

in standard tests cases, this will not be a lot slower than the previous code. But if the input enter in the scope of dan1111's comment, it might be slower. See perf demo

Javascript - Regex/replace optimization

1 Answers1