Use JavaScript regex to replace numerical HTML entities with their actual characters

Question

I'm trying to use JavaScript & regex to replace numerical HTML entities with their actual Unicode characters, e.g.

foo&#39;s bar
→
foo's bar

This is what I got so far:

"foo&#39;s bar".replace(/&#([^\s]*);/g, "$1"); // "foo39s bar"

All that's left to do is to replace the number with String.fromCharCode($1), but I can't seem to get it to work. How can I do this?

Ilia G · Accepted Answer · 2016-06-14T17:54:07.043

11

"foo&#39;s bar".replace(/&#(\d+);/g, function(match, match2) {return String.fromCharCode(+match2);})

edited Jun 14 '16 at 17:54

answered Nov 27 '10 at 15:21

Ilia G

10,043
2
40
59

That just returns `"foos bar"`. Am I missing something? Edit: Oh, apparently that's because `match` = `"'"` and not just the `39`. – alfonso Nov 27 '10 at 15:23

score 3 · Answer 2 · answered Nov 27 '10 at 15:27

3

"foo&#39;s bar".replace(/&#([^\s]*);/g, function(x, y) { return String.fromCharCode(y) })

First argument (x) is a "'" in current example. y is 39.

answered Nov 27 '10 at 15:27

Vladimir Lagunov

1,895
15
15

score 3 · Answer 3 · answered Nov 27 '10 at 16:01

As well as using a callback function, you may want to consider adding support for hex character references (ሴ).

Also, fromCharCode may not be enough. eg 𐤀 is a valid reference to a Phoenician character, but because it is outside the Basic Multilingual Plane, and JavaScript's String model is based on UTF-16 code units, not complete character code points, fromCharCode(67840) won't work. You'd need a UTF-16 encoder, for example:

String.fromCharCodePoint= function(/* codepoints */) {
    var codeunits= [];
    for (var i= 0; i<arguments.length; i++) {
        var c= arguments[i];
        if (arguments[i]<0x10000) {
            codeunits.push(arguments[i]);
        } else if (arguments[i]<0x110000) {
            c-= 0x10000;
            codeunits.push((c>>10 & 0x3FF) + 0xD800);
            codeunits.push((c&0x3FF) + 0xDC00);
        }
    }
    return String.fromCharCode.apply(String, codeunits);
};

function decodeCharacterReferences(s) {
    return s.replace(/&#(\d+);/g, function(_, n) {;
        return String.fromCharCodePoint(parseInt(n, 10));
    }).replace(/&#x([0-9a-f]+);/gi, function(_, n) {
        return String.fromCharCodePoint(parseInt(n, 16));
    });
};

alert(decodeCharacterReferences('Hello &#x10900; mum &#67840;!'));

score 0 · Answer 4 · answered Nov 27 '10 at 15:44

0

If you don't want to define all the entities you can let the browser do it for you- this bit creates an empty p element, writes the html and returns the text it produces. The p element is never added to the document.

function translateEntities(string){
    var text, p=document.createElement('p');
    p.innerHTML=string;
    text= p.innerText || p.textContent;
    p.innerHTML='';
    return text;
}
var s= 'foo&#39;s bar';
translateEntities(s);

/*  returned value: (String)
foo's bar
*/

answered Nov 27 '10 at 15:44

kennebec

102,654
32
106
127

Please don't do this. The built-in HTML parser has far too much authority to trust with arbitrary content. This is just waiting for XSS to happen. Even though script elements aren't executed as a result of setting `innerHTML`, that is just one vector. There are many others (CSS `expression`, `onerror` handlers, object and embed elements, embedded XML and external entities) to name a few that might be able to cause code execution or allow arbitrary network requests. – Mike Samuel Nov 27 '10 at 16:15

Use JavaScript regex to replace numerical HTML entities with their actual characters

4 Answers4

Linked