2

I have the following string for example:

"Hi I am testing a weird character Ů, its a U with a circle"

Now my string uses the html code Ů to display the U-circle. I need this however to be in unicode format, ie. \u016E. Is there any good systematic way to do this with plain vanilla javascript?

Mark
  • 3,653
  • 10
  • 30
  • 62
  • See http://stackoverflow.com/questions/2808368/converting-html-entities-to-unicode-character-in-javascript – Stefano Sanfilippo May 06 '13 at 14:15
  • 2
    What is "Unicode format"? You mean `U+016E` or its Javascript equivalent, `\u016E`? Or just the encoding the HTML file uses (i.e. the character itself)? By the way, Ů is not hexadecimal. – Mr Lister May 06 '13 at 14:18
  • 1
    The problem with the answers to the question linked above is that unless you're in a browser, none of them addresses decoding numeric entities. – T.J. Crowder May 06 '13 at 14:21
  • @MrLister Yes exactly, the javascript equivalent of \u016E – Mark May 06 '13 at 14:35

1 Answers1

13

If you want to convert numeric HTML character references to Unicode escape sequences, try the following (doesn't work with with code points above 0xFFFF):

function convertCharRefs(string) {
    return string
        .replace(/&#(\d+);/g, function(match, num) {
            var hex = parseInt(num).toString(16);
            while (hex.length < 4) hex = '0' + hex;
            return "\\u" + hex;
        })
        .replace(/&#x([A-Za-z0-9]+);/g, function(match, hex) {
            while (hex.length < 4) hex = '0' + hex;
            return "\\u" + hex;
        });
}

If you simply want to decode the character references:

function decodeCharRefs(string) {
    return string
        .replace(/&#(\d+);/g, function(match, num) {
            return String.fromCodePoint(num);
        })
        .replace(/&#x([A-Za-z0-9]+);/g, function(match, num) {
            return String.fromCodePoint(parseInt(num, 16));
        });
}

Both functions use String.replace with a function as replacement.

nwellnhof
  • 32,319
  • 7
  • 89
  • 113