4

jQuery.ajax() is doing something weird when escaping my data.

For example, if I send the request:

$.ajax({
    url: 'somethinguninteresting',
    data: {
        name: 'Ihave¬aweirdcharacter';
    }
});

then investigate the XHR in Chrome devtools, it shows the "Request Payload" as name=Ihave%C2%ACaweirdcharacter

Now, I've figured out that:

'¬'.charCodeAt(0) === 172

and that 172 is AC in hexadecimal.

Working backwards, C2 (the "extra" character being prepended) in hexadecimal is 194 in decimal, and

String.fromCharCode(194) === 'Â'

My Question:

Why does

encodeURIComponent('¬')

return '%C2%AC', which would appear to be the result of calling

encodeURIComponent('¬')

(which itself returns '%C3%82%C2%AC')?

Alex McMillan
  • 17,096
  • 12
  • 55
  • 88

2 Answers2

2

Although JavaScript uses UTF-16 (or UCS-2) internally, it performs URI encoding based on UTF-8.

The ordinal value of 172 is encoded in two bytes, because it can no longer be represented by ASCII; two-byte encoding in UTF-8 is done this way:

110xxxxx 10xxxxxx

In the place of x we fill in the binary representation of 172, which is 10101100:

11000010 10101100 = C2AC
   ^^^
   pad

This outcome is then percent encoded to finally form %C2%AC which is what you saw in the request payload.

Ja͢ck
  • 170,779
  • 38
  • 263
  • 309
  • Aaah - so it's actually just a -coincidence- that the second byte happens to be 172 in binary! That was *really* throwing me off. Thank you for your explanation. Am I correct in thinking that your "^^^ pad" is off by one character? – Alex McMillan Nov 26 '14 at 11:14
  • 1
    @AlexMcMillan Eh yeah, it's off .. ascii art mistake ;-) – Ja͢ck Nov 26 '14 at 11:16
  • Very nice, far easier to understand when its described visually like that. Thank you – Alex McMillan Nov 26 '14 at 11:19
0

Url encoding (or percent encoding), encodes unicode characters using UTF-8. UTF-8 encodes characters with varying numbers of bytes. The ¬ character is encoded in UTF-8 as C2 AC.

The charCodeAt method does not handle multi-byte sequences. See this answer https://stackoverflow.com/a/18729931/4231110 for more details on how to use charCodeAt to encode a string with UTF-8.

In short, %C2%AC is the correct percent encoding of ¬. This can be demonstrated by running

decodeURIComponent('%C2%AC') // '¬'
Community
  • 1
  • 1
Justin Howard
  • 5,504
  • 1
  • 21
  • 48
  • `C2 AC` is hexadecimal or base16 encoded, not UTF-8. – Alexander O'Mara Nov 26 '14 at 05:41
  • @AlexanderO'Mara I don't follow. The unicode character is [U+00AC](http://www.fileformat.info/info/unicode/char/ac/index.htm) which is encoded in UTF-8 as the hexadecimal string C2AC, or the binary sequence 1100001010101100 if you prefer that. – Justin Howard Nov 26 '14 at 05:49
  • 1
    The transformation of `¬` to `%C2%AC` is technically base16 encoding. `C2 AC` is the hexadecimal representation of the `¬` in UTF-8 encoding. – Alexander O'Mara Nov 26 '14 at 06:00
  • `%C2%AC` is the direct utf8 encoding of `¬` http://www.w3schools.com/tags/ref_urlencode.asp i dont understand why is it confusing or `weird` – shaN Nov 26 '14 at 06:12
  • That seems like a *massive* amount of mucking around to send a plain text string from a browser to server via AJAX, the likes of which don't appear in any similar javascript I've seen/read. What value would I send as `data.name` in the above example to get the expected result (ie sending the string correctly)? – Alex McMillan Nov 26 '14 at 06:37
  • You don't need to do anything at all. The point is that `%C2%AC` *is* the correct encoding. – Justin Howard Nov 26 '14 at 06:39