5

I'm using JavaScript to set the value of an input with text that may contain HTML specific chars such a &   etc. So, I'm trying to find one regex that will match these values and replace them with the appropriate value ("&", " ") respectively, only I can't figure out the regex to do it.

Here's my attempt:

Make an object that contains the matches and reference to the replacement value:

var specialChars = {
  " " : " ",
  "&"  : "&",
  ">"   : ">",
  "&amp;lt;"   : "<"
}

Then, I want to match my string

var stringToMatch = "This string has special chars &amp;amp; and &amp;nbsp;"

I tried something like

stringToMatch.replace(/(&amp;nbsp;|&amp;)/g,specialChars["$1"]);

but it doesn't work. I don't really understand how to capture the special tag and replace it. Any help is greatly appreciated.

wp78de
  • 18,207
  • 7
  • 43
  • 71
brad
  • 31,987
  • 28
  • 102
  • 155

5 Answers5

18

I think you can use the functions from a question on a slightly different subject (Efficiently replace all accented characters in a string?).

Jason Bunting's answer has some nice ideas + the necessary explanation, here is his solution with some modifications to get you started (if you find this helpful, upvote his original answer as well, as this is his code, essentially).

var replaceHtmlEntites = (function() {
    var translate_re = /&(nbsp|amp|quot|lt|gt);/g,
        translate = {
            'nbsp': String.fromCharCode(160), 
            'amp' : '&', 
            'quot': '"',
            'lt'  : '<', 
            'gt'  : '>'
        },
        translator = function($0, $1) { 
            return translate[$1]; 
        };

    return function(s) {
        return s.replace(translate_re, translator);
    };
})();

callable as

var stringToMatch = "This string has special chars &amp; and &amp;nbsp;";
var stringOutput  = replaceHtmlEntites(stringToMatch);

Numbered entites are even easier, you can replace them much more generically using a little math and String.fromCharCode().


Another, much simpler possibility would be like this (works in any browser)

function replaceHtmlEntites(string) {
    var div = document.createElement("div");
    div.innerHTML = string;
    return div.textContent || div.innerText;
}

replaceHtmlEntites("This string has special chars &lt; &amp; &gt;");
// -> "This string has special chars < & >"
Community
  • 1
  • 1
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • So apparently I don't understand regex's. Your code looked pretty good to me, but the value (s) being passed into the function actually contains the whole   Not just the nbsp. I thought the brackets were supposed to match just the inside chars? Anyway, modding that translate object to contain the whole " ", "&" etc. worked. otherwise it just returned undefined. Thanks – brad Aug 04 '09 at 20:34
  • The answer has been modified a little bit to accommodate for this. I guess you've tried with the original code. The above works for me, I've just tried it out (again). – Tomalak Aug 05 '09 at 06:24
  • Really? No I copied your code and ran it through the debugger, the value for me being passed in (s) was the whole  . Very odd. I'm using safari and I tested in Firefox. I'll try a few other browsers too. ANyway thanks again – brad Aug 05 '09 at 13:39
  • 1
    Sorry, I just noticed the changes. The extra entity attr in the return function. Thx again!! – brad Aug 05 '09 at 13:50
2

Another way would be creating a div object

var tmp = document.createElement("div");

Then assigning the text to its innerHTML

tmp.innerHTML = mySpecialString;

And finally reading the element's text content

var output = tmp.textContent || tmp.innerText //for IE compatibility

And there you go...

BYK
  • 1,359
  • 3
  • 15
  • 37
  • I'm using the text to set a value of an input (w/ jquery) so $(input).val(someText) it's the someText that needs the replacement – brad Aug 04 '09 at 20:47
  • Okay I got the point. When you do what I have suggested, all the values are converted by the HTML engine of the browser since the "textContent" or "innerText" property contains the "resultant text". – BYK Aug 04 '09 at 21:12
1

You can use a function based replacement to do what you want to do:

var myString = '&'+'nbsp;&'+'nbsp;&tab;&copy;';
myString.replace(/&\w+?;/g, function( e ) {
    switch(e) {
        case '&nbsp;': 
            return ' ';
        case '&tab;': 
            return '\t';
        case '&copy;': 
            return String.fromCharCode(169);
        default: 
            return e;
    }
});

However, I do urge you to consider your situation. If you're receiving &nbsp; and &copy; and other HTML entities in your text values, do you really want to replace them? Should you be converting them afterwards?

Just something to keep in mind.

Cheers!

Umber Ferrule
  • 3,358
  • 6
  • 35
  • 38
coderjoe
  • 11,129
  • 2
  • 26
  • 25
  • 1
    This is way more straightforward than the above accepted answer. Also, I believe it will scale better as more entities are added to the list, which is important since the list of named entities is SUPER long. You got robbed son! – Toby Apr 06 '11 at 00:35
  • 1
    I'm not in it for the points. I'm in it for the questions. But thanks for the sentiment. :) There's also the fact that the default case should be return e; not wrapping. Fixed above. – coderjoe Apr 07 '11 at 14:56
0

A modern variation that doesn't use painful switch/case statements:

const toEscape = `<code> 'x' & "y" </code> <\code>`

toEscape.replace(
  /[&"'<>]/g,
  (char) => ({
      "&": '&amp;',
      "\"": '&quot;',
      "'": '&#39;',
      "<": '&lt;',
      ">": '&gt;',
    })[char]
)

Or, since this really should be turned into a function:

const encodeHTML = function(str) {
    const charsToEncode = /[&"'<>]/g
    const encodeTo = {
      "&": '&amp;',
      "\"": '&quot;',
      "'": '&#39;',
      "<": '&lt;',
      ">": '&gt;',
    }
    return str.replace(charsToEncode, char => encodeTo[char])
}

(This list of characters is chosen based on the list of XML-escape-char-codes available on wikipedia.)

Kyle Baker
  • 3,424
  • 2
  • 23
  • 33
0

a more better approach for replace any HTML tags & HTML special characters would be to just replace these with REGEX

str.replace(/<[^>]*>/g, '').replace(/[^\w\s]/gi, '')
Alberto S.
  • 1,805
  • 23
  • 39