0

I'm having an encoding problem related to cookies on one of my websites.

A user is inputing Usuário, which has an acute accent, and that's being put in a cookie. The raw HEX for the cookie response is (for the Usuário string):

55 73 75 C3 A1 72 69 6F

When I see it in the browser, it looks like this:

enter image description here

...which is really messy. I need to fix this up.

Then I went to this website: http://www.rapidtables.com/convert/number/hex-to-ascii.htm and converted the HEX value to see how it would look like. And I got the same output:

enter image description here

Right. This means the HEX code is wrong. Then I tried to convert Usuário to ASCII to see how it should be. I used this WebSite: http://www.asciitohex.com/ and this is the result:

enter image description here

For my surprise, the HEX is exactly the one that is showing up messy. Why???

And how do I represent Usuário in ASCII so I can put it in a cookie? Should I manually encode it?

PS: I'm using ASP.NET, just in case it matters.

Andre Pena
  • 56,650
  • 48
  • 196
  • 243
  • ASCII does not support accented characters. You might try some local character set (like cp850), but it's better to set all your environments to UTF8 and force the client's browser to be UTF8 as well with the proper meta tags and headers – SztupY Mar 24 '15 at 23:46
  • note that the hex representation you posted is in utf8 and not ascii. – SztupY Mar 24 '15 at 23:47
  • Thanks @SztupY, but how do you know it's UTF-8? – Andre Pena Mar 24 '15 at 23:50
  • Because the other Latin characters are a single byte each (8 bits), which conforms to UTF-8. UTF-8 uses variable-length encoding, with a minimum length of 1 byte. UTF-16, for example, has a minimum encoding length of 2 bytes (16 bits), but only the accented character is 2 bytes long. I recommend reading this: http://www.joelonsoftware.com/articles/Unicode.html – Reticulated Spline Mar 25 '15 at 01:46

1 Answers1

1

As of 2015 the standard of the web to store character data is UTF-8 and not ASCII. ASCII actually only contains the first 128 characters of the codepage, and does not include any kind of accented characters. To add accented characters to this 128 characters there were many legacy solutions: codepages. They each added 128 different characters to the default ASCII list thereby allowing representing 256 different characters.

The problem was, that this didn't properly solve the issue: ASCII based codepages were more or less incomatible with each other (except for the first 128 characters), and there was usually no way of programatically knowing which codepage was in used.

One of the solutions was UTF-8, which is a way to encode the unocde character set (containing most of the characters used around the world, and more) while trying to remain compatible with ASCII. The first 128 characters are actually the same in both cases, but afterwards UTF-8 characters become multi-byte: one character is encoded using a series of bytes (usually 2-3, depends on which character needs to be encoded)

The problem is if you are using some kind of ASCII based single byte codebase (like ISO-8859-1), which encodes supported characters in single bytes, but your input is actually UTF-8, which will encode accented characters in multiple bytes (you can see this in your HEX example. á is encoded as C3 A1: two bytes). If you try to read these two bytes in an ASCII based codepage, which uses single bytes for every characters (in West-Europe this codepage is usually ISO-8859-1), then each of this two bytes will be reprensented with two different characters.

In the web world the default encoding is UTF-8, so your clients will usually send their requests using UTF-8. ASP.NET is Unicode aware, so it can handle these requests. However somewere in your code this UTF-8 is converted acccidentally into ISO-8859-1, and then back into UTF-8. This might happen on various layers. As you have issues it probably happens at the cookie layer, which is sometimes problematic (here is how it worked in 2009). You should also double check your application that it uses UTF-8 everywhere else though (views, database, etc.), if you want to properly support accented characters.

Community
  • 1
  • 1
SztupY
  • 10,291
  • 8
  • 64
  • 87