0

Been revisiting the question of why some kinds of character data are corrupted when sent via an ajax call to a webserver, no matter what encoding is used. Even if the data is precoded into a 7-bit format, what comes out is still not always equal to what went in.

I was using a third party javascript base64 encoder to prepare ajax data, and initally thought this had a bug. But, other base64 encoders show exactly the same problem -including one which claims full unicode compatibility- and there are several forum reports of similar problems, none of which seem to have been fully resolved. So, I don't think the encoder itself is at fault.

I noticed that the corruption typically arises with data cut-and-pasted from other programs into CKEditor, if that data contains certain specific high-order ASCII/ANSI codes.

A few more tests seem to indicate that the problem has to do with some kind of discrepancy between the way javascript reads character data from a webpage, and the way it forms string data from internal programmatic methods, for example String.fromCharCode().

In the snippet below, the handling of character 0x9E inserted into an HTML document by cut-and-paste from a text editor is compared with the same character generated programmatically from hex code 0x9E (U+017E - Arial Latin small z with caron, Windows Western charset). This is one of several character codes which have been seen to give rise to this anomalous behaviour. Strangely, most other >127 character-codes give no such problems, and are rendered as two-byte unicode as they should be.

<script>
  var pasted_char = 'ž';
  alert("Pasted Character: " + pasted_char + " Resultant Code(s): " + charcodes(pasted_char));

  var charcode = 0x9E;
  var generated_char = String.fromCharCode(charcode);
  alert("Generated Character: " + generated_char + " Resultant Code(s): " + charcodes(generated_char));

function charcodes(invar) {
  // lists char codes for each byte in a character. 
  var ccodes = "~";
  for (ct=0; ct<invar.length; ct++){
    var invarc = invar.charCodeAt(ct);
    ccodes += invarc + "~";
  }
  return ccodes;
}
</script>

With a UTF-8 page charset, gives:

Pasted Character: [0xFFFD] Resultant Code(s): ~65533~

Generated Character: [blank] Resultant Code(s): ~158~

With a default page charset, gives:

Pasted Character: ž Resultant Code(s): ~382~

Generated Character: [blank] Resultant Code(s): ~158~

Notably, neither handling of the pasted character is correct, and there is no such ANSI code as 382!

Both outputs are single byte.

Strictly speaking this character is 8-bit ASCII/ANSI, which js does not claim to handle, however it is perfectly legitimate for it to be pasted into an HTML editor, for example from a text document. Thus the javascript subsystem should be capable of handling such input without bugs arising. It certainly seems to me, anyway, that generating the same character string in two different ways should not return two different results.

Any thoughts on this would be welcome. I am not sure exactly what role this anomaly plays in corrupting the ajax send, but it seems likely it is the culprit.

1 Answers1

0

All Strings in JavaScript are in UTF-16 (and occasionally it's precursor USC-2), regardless of the character encoding of the page. This is stated in section 8.4 of the ES5 specification, and section 8.5 in ES3. For common characters such as a-z etc, this has little effect on if you want ANSI or UTF-8 codes because they are the same, but this not true for all characters.

If you want to generate ANSI, you will need a 256-item dictionary or some other logic for the character mappings.


Here is such a table (without control chars)

var ANSI = {
    " ": 32,
    "!": 33,
    "\"": 34,
    "#": 35,
    "$": 36,
    "%": 37,
    "&": 38,
    "'": 39,
    "(": 40,
    ")": 41,
    "*": 42,
    "+": 43,
    ",": 44,
    "-": 45,
    ".": 46,
    "/": 47,
    "0": 48,
    "1": 49,
    "2": 50,
    "3": 51,
    "4": 52,
    "5": 53,
    "6": 54,
    "7": 55,
    "8": 56,
    "9": 57,
    ":": 58,
    ";": 59,
    "<": 60,
    "=": 61,
    ">": 62,
    "?": 63,
    "@": 64,
    "A": 65,
    "B": 66,
    "C": 67,
    "D": 68,
    "E": 69,
    "F": 70,
    "G": 71,
    "H": 72,
    "I": 73,
    "J": 74,
    "K": 75,
    "L": 76,
    "M": 77,
    "N": 78,
    "O": 79,
    "P": 80,
    "Q": 81,
    "R": 82,
    "S": 83,
    "T": 84,
    "U": 85,
    "V": 86,
    "W": 87,
    "X": 88,
    "Y": 89,
    "Z": 90,
    "[": 91,
    "\\": 92,
    "]": 93,
    "^": 94,
    "_": 95,
    "`": 96,
    "a": 97,
    "b": 98,
    "c": 99,
    "d": 100,
    "e": 101,
    "f": 102,
    "g": 103,
    "h": 104,
    "i": 105,
    "j": 106,
    "k": 107,
    "l": 108,
    "m": 109,
    "n": 110,
    "o": 111,
    "p": 112,
    "q": 113,
    "r": 114,
    "s": 115,
    "t": 116,
    "u": 117,
    "v": 118,
    "w": 119,
    "x": 120,
    "y": 121,
    "z": 122,
    "{": 123,
    "|": 124,
    "}": 125,
    "~": 126,
    " ": 127,
    "€": 128,
    " ": 129,
    "‚": 130,
    "ƒ": 131,
    "„": 132,
    "…": 133,
    "†": 134,
    "‡": 135,
    "ˆ": 136,
    "‰": 137,
    "Š": 138,
    "‹": 139,
    "Œ": 140,
    " ": 141,
    "Ž": 142,
    "«": 143,
    " ": 144,
    "‘": 145,
    "’": 146,
    "“": 147,
    "”": 148,
    "•": 149,
    "–": 150,
    "—": 151,
    "˜": 152,
    "™": 153,
    "š": 154,
    "›": 155,
    "œ": 156,
    " ": 157,
    "ž": 158,
    "Ÿ": 159,
    " ": 160,
    "¡": 161,
    "¢": 162,
    "£": 163,
    "¤": 164,
    "¥": 165,
    "¦": 166,
    "§": 167,
    "¨": 168,
    "©": 169,
    "ª": 170,
    "«": 171,
    "¬": 172,
    "­": 173,
    "®": 174,
    "¯": 175,
    "°": 176,
    "±": 177,
    "²": 178,
    "³": 179,
    "´": 180,
    "µ": 181,
    "¶": 182,
    "·": 183,
    "¸": 184,
    "¹": 185,
    "º": 186,
    "»": 187,
    "¼": 188,
    "½": 189,
    "¾": 190,
    "¿": 191,
    "À": 192,
    "Á": 193,
    "Â": 194,
    "Ã": 195,
    "Ä": 196,
    "Å": 197,
    "Æ": 198,
    "Ç": 199,
    "È": 200,
    "É": 201,
    "Ê": 202,
    "Ë": 203,
    "Ì": 204,
    "Í": 205,
    "Î": 206,
    "Ï": 207,
    "Ð": 208,
    "Ñ": 209,
    "Ò": 210,
    "Ó": 211,
    "Ô": 212,
    "Õ": 213,
    "Ö": 214,
    "×": 215,
    "Ø": 216,
    "Ù": 217,
    "Ú": 218,
    "Û": 219,
    "Ü": 220,
    "Ý": 221,
    "Þ": 222,
    "ß": 223,
    "à": 224,
    "á": 225,
    "â": 226,
    "ã": 227,
    "ä": 228,
    "å": 229,
    "æ": 230,
    "ç": 231,
    "è": 232,
    "é": 233,
    "ê": 234,
    "ë": 235,
    "ì": 236,
    "í": 237,
    "î": 238,
    "ï": 239,
    "ð": 240,
    "ñ": 241,
    "ò": 242,
    "ó": 243,
    "ô": 244,
    "õ": 245,
    "ö": 246,
    "÷": 247,
    "ø": 248,
    "ù": 249,
    "ú": 250,
    "û": 251,
    "ü": 252,
    "ý": 253,
    "þ": 254,
    "ÿ": 255
};

I generated this with the following code applied to this page and copied and pasted here with some very minor modifications (escapes on \ and "), so you'll notice some characters didn't cross properly (notably the different types of space) and may need to be removed/deleted/modified before you can use it. You might also want to switch to the character encoding safe \uXXXXformat for the keys.

var cells = document.getElementsByTagName('table')[0].getElementsByTagName('td'),
    a = [], i, j, k, v;
for (j = 0; j < 7; ++j) for (i = 7 + j; i < cells.length; i += 7) {
    k = cells[i].textContent.slice(-1);
    v = +cells[i].textContent.slice(0, 3).replace(/[^\d]/g, '');
    a.push('    "' + k + '": ' + v);
}
'{\n' + a.join(',\n') + '\n}';
Paul S.
  • 64,864
  • 9
  • 122
  • 138
  • I think you are missing the point, which is that javascript handles certain specific characters in cut-and-pasted text in an anomalous manner, and outputs nonsense instead of the correct UTF codes. Regardless of character-set, this should not happen. –  Sep 27 '13 at 07:35
  • @IanR What I'm saying is; once interpreted by the browser, the _ANSI_ `ž` at `142` becomes the _UTF-16_ `ž` at `382`. As soon as you paste it, it immediately becomes `382` - that it was ever `142` gets forgotten. `String.fromCharCode(382); // "ž"` – Paul S. Sep 27 '13 at 10:37
  • @IanR edited in a dictionary such as that which I spoke about creating, but please read comments about it – Paul S. Sep 27 '13 at 11:08
  • I do not understand how a single byte character code can have a value of 382. Surely,a UTF character should consist of one or more values from 0 to 0xFF? charCodeAt(ct) appears to correctly return two or four bytes on other unicode data. –  Sep 29 '13 at 09:16
  • S:"The point is, that in UTF-16, ž is not a single byte" -I am aware of that. Unfortunately, being aware of that does not solve the problem, of base64 encoding of some specific chars giving rise to erroneous data. It looks like the problem can be worked around by hex-string encoding the entire data before base64 encoding it, but that is extremely inefficient. I'm wondering if these (relatively few and obscure) problem characters can be detected reliably, if so simply stripping them would be acceptable. –  Sep 29 '13 at 21:54
  • `/[^\u0000-\u00ff]/g.test('ž'); // true` ? (`/[^\u0000-\u00ff]/g.test('a'); // false`) – Paul S. Sep 29 '13 at 22:00
  • I can give that a try but I don't see why it would work when examining individual bytes does not always return the correct charcode. GIGO is the problem. –  Sep 30 '13 at 10:26
  • `1 char` is not `1 byte`. `ž` is `1 char` but `2 bytes`. `ž` is represented by the bytes `7E 01` - the number `0x17E === 382` - so can be called `U+017E`. http://www.fileformat.info/info/unicode/char/17e/index.htm – Paul S. Sep 30 '13 at 13:37