1

I've hit a strange situation that I just can't seem to figure my way out of. I have a string that contains UTF8 characters (escaped). I've tried the decodeURIComponent(escape(str)) along with a bunch of other suggested fixes, as yet without success.

I've written this function to take the string, find the escaped characters, and replace them with straight UTF8.

var unescapeUTF8 = function(str) {
    var matches = str.match(/\\u.{4}/g);
    if (matches == null) return str;
    for (var item of matches)
    {
        // testing
        console.log(new String(item));
    }
    ....
    ....
    ....
};

From testing, I know that if I go new String("\u0123") I will get back a string object String {0: "ģ", length: 1, [[PrimitiveValue]]: "ģ"}

It seems no matter what I do to the string in the function above, I can not get it to convert from it's escaped \u0123 to ģ

I've managed to 'create' the issue in my browser by opening developer tools and running the following

var x = "\\u0123";
console.log(x); // == "\u0123"
new String(x); // == String {0: "\", 1: "u", 2: "1", 3: "3", 4: "2", 5: "4", length: 6, [[PrimitiveValue]]: "\u1324"}

Can anyone figure out how to convert "x" into a UTF8 character please...

TolMera
  • 452
  • 10
  • 25
  • and ``new String((new String("\\u0123")).toString())`` does not work, it seems to be holding onto that prefixing \ somewhere somehow. – TolMera Oct 25 '17 at 14:32
  • [How do I decode a string with escaped unicode?](https://stackoverflow.com/questions/7885096/how-do-i-decode-a-string-with-escaped-unicode) ? – Alex K. Oct 25 '17 at 14:34
  • 1
    `new String("\u0123")` is a false trail because the string is already that character `new String("\u0123") == "\u0123"` is `true`. – Alex K. Oct 25 '17 at 14:35
  • `\u....` is not a "UTF-8 encoded character", it's a Unicode escape sequence. You cannot tell whether a string is encoded in UTF-8 or something else just by looking at it; you can however tell that the characters in the string represent some escape format. – deceze Oct 25 '17 at 14:41
  • re-read the question ``new String("\\u0123") == ģ`` is ``false`` BUT ``new String("\u0123") == 'ģ'`` is ``true``. BUT again... ``var x = "\\u0123"; new String(x) == '\u0123'`` is ``false`` – TolMera Oct 25 '17 at 14:42

1 Answers1

3

Since those escape sequences are, at first blush, valid JSON escape sequences, the easiest method is to parse the string as a JSON string:

var x = "\\u0123";
console.log(JSON.parse('"' + x + '"'));
deceze
  • 510,633
  • 85
  • 743
  • 889
  • Emphasizing: This answer has nothing to do with UTF-8, which makes sense because it seems that the question has nothing to do with UTF-8 (as observed in a comment on the question). The escapes formatted as "\uABCD" are for UTF-16 code units, which is what JavaScript and JSON do use in strings. – Tom Blodget Oct 25 '17 at 22:36