67

I'm making a javascript app which retrieves .json files with jquery and injects data into the webpage it is embedded in.

The .json files are encoded with UTF-8 and contains accented chars like é, ö and å.

The problem is that I don't control the charset on the pages that are going to use the app.

Some will be using UTF-8, but others will be using the iso-8859-1 charset. This will of course garble the special chars from the .json files.

How do I convert special UTF-8 chars to their iso-8859-1 equivalent using javascript?

wazz
  • 4,953
  • 5
  • 20
  • 34
Hobhouse
  • 15,463
  • 12
  • 35
  • 43

7 Answers7

176

Actually, everything is typically stored as Unicode of some kind internally, but lets not go into that. I'm assuming you're getting the iconic "åäö" type strings because you're using an ISO-8859 as your character encoding. There's a trick you can do to convert those characters. The escape and unescape functions used for encoding and decoding query strings are defined for ISO characters, whereas the newer encodeURIComponent and decodeURIComponent which do the same thing, are defined for UTF8 characters.

escape encodes extended ISO-8859-1 characters (UTF code points U+0080-U+00ff) as %xx (two-digit hex) whereas it encodes UTF codepoints U+0100 and above as %uxxxx (%u followed by four-digit hex.) For example, escape("å") == "%E5" and escape("あ") == "%u3042".

encodeURIComponent percent-encodes extended characters as a UTF8 byte sequence. For example, encodeURIComponent("å") == "%C3%A5" and encodeURIComponent("あ") == "%E3%81%82".

So you can do:

fixedstring = decodeURIComponent(escape(utfstring));

For example, an incorrectly encoded character "å" becomes "Ã¥". The command does escape("Ã¥") == "%C3%A5" which is the two incorrect ISO characters encoded as single bytes. Then decodeURIComponent("%C3%A5") == "å", where the two percent-encoded bytes are being interpreted as a UTF8 sequence.

If you'd need to do the reverse for some reason, that works too:

utfstring = unescape(encodeURIComponent(originalstring));

Is there a way to differentiate between bad UTF8 strings and ISO strings? Turns out there is. The decodeURIComponent function used above will throw an error if given a malformed encoded sequence. We can use this to detect with a great probability whether our string is UTF8 or ISO.

var fixedstring;

try{
    // If the string is UTF-8, this will work and not throw an error.
    fixedstring=decodeURIComponent(escape(badstring));
}catch(e){
    // If it isn't, an error will be thrown, and we can assume that we have an ISO string.
    fixedstring=badstring;
}
Kimon
  • 17
  • 1
  • 8
nitro2k01
  • 7,627
  • 4
  • 25
  • 30
  • 1
    I have referenced your answer on the answer for my own question over here : http://stackoverflow.com/questions/18847191/is-there-a-uniform-method-in-both-php-and-js-to-convert-unicode-characters/18863966#18863966 – hsuk Sep 18 '13 at 04:40
  • @nitro : Does javascript considers every utf-8 chars as ISO latin ? – hsuk Sep 18 '13 at 04:41
  • 2
    `escape` encodes extended ISO-8859-1 characters (UTF code points U+0080-U+00ff) as `%xx` (two-digit hex) whereas it encodes UTF codepoints U+0100 and above as `%uxxxx` (`%u` followed by four-digit hex.) For example, `escape("å") == "%E5"` and `escape("あ") == "%u3042"`. `encodeURIComponent` percent-encodes extended characters as a UTF8 byte sequence. For example, `encodeURIComponent("å") == "%C3%A5"` and `encodeURIComponent("あ") == "%E3%81%82"`. I hope that clears up any questions. – nitro2k01 Sep 18 '13 at 21:35
  • 6
    @nitro2k01: I'm getting an error with your suggestion: `Uncaught URIError: URI malformed ` – Luis A. Florit Mar 23 '14 at 15:28
  • @LuisA.Florit Try the last snippet. – nitro2k01 Mar 23 '14 at 17:08
  • @nitro2k01 Of course, stupid me. Still, I get strange characters. Please take a look here if you have some time (thanks!): http://stackoverflow.com/questions/22592759/whos-responsible-for-this-wrong-encoding – Luis A. Florit Mar 23 '14 at 18:18
  • 2
    escape function will be deprecated!! https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/escape – TheGr8_Nik Oct 30 '14 at 09:56
  • Sos grosso, sabelo, +10 y a fav washin :v – Hackerman May 19 '16 at 14:43
  • what about utf-8 iso8859-9 ? – Wasim A. Feb 11 '17 at 08:22
  • Thank you for the answer. Helpful. – Gary Jul 19 '17 at 13:11
  • You just saved me a lot of time... Was losing my mind reading about RFC and what not – Eyewritecode Apr 28 '21 at 09:33
  • 2
    @Eyewritecode I'm glad that I could help, but I feel sad that we still need this hack 10 years later... – nitro2k01 Apr 29 '21 at 03:10
11

The problem is that once the page is served up, the content is going to be in the encoding described in the content-type meta tag. The content in "wrong" encoding is already garbled.

You're best to do this on the server before serving up the page. Or as I have been know to say: UTF-8 end-to-end or die.

Diodeus - James MacFarlane
  • 112,730
  • 33
  • 157
  • 176
  • Though my page header already says its on utf-8, I had to convert it to ISO Latin for further encryption. http://stackoverflow.com/questions/18786025/mcrypt-js-encryption-value-is-different-than-that-produced-by-php-mcrypt-mcryp – hsuk Sep 18 '13 at 04:52
  • that does not answer the question! – Remigius Stalder Apr 03 '21 at 11:43
5

Since the question on how to convert from ISO-8859-1 to UTF-8 is closed because of this one I'm going to post my solution here.

The problem is when you try to GET anything by using XMLHttpRequest, if the XMLHttpRequest.responseType is "text" or empty, the XMLHttpRequest.response is transformed to a DOMString and that's were things break up. After, it's almost impossible to reliably work with that string.

Now, if the content from the server is ISO-8859-1 you'll have to force the response to be of type "Blob" and later convert this to DOMSTring. For example:

var ajax = new XMLHttpRequest();
ajax.open('GET', url, true);
ajax.responseType = 'blob';
ajax.onreadystatechange = function(){
    ...
    if(ajax.responseType === 'blob'){
        // Convert the blob to a string
        var reader = new window.FileReader();
        reader.addEventListener('loadend', function() {
           // For ISO-8859-1 there's no further conversion required
           Promise.resolve(reader.result);
        });
        reader.readAsBinaryString(ajax.response);
    }
}

Seems like the magic is happening on readAsBinaryString so maybe someone can shed some light on why this works.

Community
  • 1
  • 1
Eldelshell
  • 6,683
  • 7
  • 44
  • 63
1

Internally, Javascript strings are all Unicode (actually UCS-2, a subset of UTF-16).

If you're retrieving the JSON files separately via AJAX, then you only need to make sure that the JSON files are served with the correct Content-Type and charset: Content-Type: application/json; charset="utf-8"). If you do that, jQuery should already have interpreted them properly by the time you access the deserialized objects.

Could you post an example of the code you’re using to retrieve the JSON objects?

Martijn
  • 13,225
  • 3
  • 48
  • 58
  • It is irrelevant, both setting only the content-type or also the charset: jQuery interprets the served json exactly the same way. Probably because the spec (http://www.ietf.org/rfc/rfc4627.txt) says that `JSON text SHALL be encoded in Unicode. The default encoding is UTF-8`. So setting the header to `Content-Type: application/json; charset="iso-8859-1"` after json encoding text from a variable get from a file encoded in iso-8859-1 and sending it by ajax to an iso-8859-1 encoded html page produces the same result as not specifying anything: the strings being interpreted by the browser as `NULL` – Pere May 29 '15 at 10:57
1

There are libraries that do charset conversion in Javascript. But if you want something simple, this function does approximately what you want:

function stringToBytes(text) {
  const length = text.length;
  const result = new Uint8Array(length);
  for (let i = 0; i < length; i++) {
    const code = text.charCodeAt(i);
    const byte = code > 255 ? 32 : code;
    result[i] = byte;
  }
  return result;
}

If you want to convert the resulting byte array into a Blob, you would do something like this:

const originalString = 'ååå';
const bytes = stringToBytes(originalString);
const blob = new Blob([bytes.buffer], { type: 'text/plain; charset=ISO-8859-1' });

Now, keep in mind that some apps do accept UTF-8 encoding, but they can't guess the encoding unless you prepend a BOM character, as explained here.

Jose Solorzano
  • 393
  • 4
  • 6
0

As escape is deprecated (and didn't actually work for me), I used a small library for the encoding. I went with a library called iso-8859-15. Note that ISO-8859-15 only differs to ISO-8859-1 by a few characters (comparison), and chances are your input is actually ISO-8859-15 rather than ISO-8859-1.

import {encode} from 'iso-8859-15';

const encodedBytes = new Uint8Array(encode(unicodeString))
const blob = new Blob([encodedBytes])
marcelj
  • 598
  • 5
  • 14
-4

you should add this line above your page

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197