C++ character encoding UTF-8

Question

I've got the following code which converts unicode to the appropriate character e.g. When a user enters úsername into the browser %FAsername is returned to the code which then converts it back to úsername.

However when the browser encoding is set to UTF-8 the value passed to the code is %C3%BAsername which is then converted to Ãºsername which is the wrong value expected for authentication. How can I modify the code to make it UTF-8 compatible?

C3 BA *is* the UTF-8 encoding of `ú`. What do you want to have happen, exactly? (http://www.fileformat.info/info/unicode/char/fa/index.htm) — Carl Norum, Sep 03 '13 at 15:17
I want C3 BA to be converted to ú, at the moment the result of the conversion is Ãº @CarlNorum — user2724841, Sep 03 '13 at 15:18
If you're sending the C3 BA, you're already sending UTF-8. You need to figure out what's wrong on the *displaying* end, then. — Carl Norum, Sep 03 '13 at 15:19
Yes, I understand I am sending UTF8 but how do I convert C3 BA to ú ? @CarlNorum — user2724841, Sep 03 '13 at 15:21
Maybe I'm misunderstanding; you mean you want to get C3 BA and convert it to FA? Just get the right bits out of the stream: http://en.wikipedia.org/wiki/UTF-8 — Carl Norum, Sep 03 '13 at 15:22
The stream is from a browser encoded to UTF-8 so the string recieved will always be C3 BA how can this be converted to FA???? @CarlNorum — user2724841, Sep 03 '13 at 15:29
@user2724841 You've asked this question a couple of times before, you accepted this answer http://stackoverflow.com/questions/18534494/convert-from-utf-8-to-unicode-c, I also gave you an answer another time. I think you should say what was wrong for you with the previous answers, otherwise you're just going to get an answers that you can't work with again. In general terms the answer will always be the same, you write some code that follows the rules for UTF-8 decoding. — john, Sep 03 '13 at 15:32
I think your question above shows a misunderstanding, the conversion '%C3%BAsername' to 'Ãºsername' is correct. 'Ãºsername' is a UTF-8 encoded string. The conversion to 'úsername' happens later. So you don't need to modify the code above at all, you need to add some code **afterwards**. Some code similar to what has been suggested before. — john, Sep 03 '13 at 15:40

Joop Eggen · Answer 1 · 2013-09-03T15:52:57.943

2

No answer

There are a couple of things slightly wrong. ú has unicode number U+00FA, or as we developers say: 0x00FA. Unicode has 3x2^16 characters. In UTF-8 multi-byte sequences are used. For 7-bit pure ASCII Unicode = ASCII. However for U+00FA more than one byte is needed.

%C3%BA seems correct, as %XX is a byte, URL encoded. For U+0109, ĉ, a single byte, like %FA would not do.

For UTF-8 decoding/encoding from a wide char string there exist sufficient code snippets.

I am afraid some handling has to change.

Normal procedure

One receives an URL encoded string: with %XX.

char* url_decode(const char*) // would translate %xx to char.

Now you have a byte stream, arrived as UTF-8: a multi-byte UTF-8 string.

wchar_t* utf8_decode(const char* bytes) // would translate bytes into text.

Resolves multi-byte sequences into a string of UTF-16 characters.

edited Sep 03 '13 at 15:52

answered Sep 03 '13 at 15:27

Joop Eggen

107,315
7
83
138

So will I be able to convert the string C3 BA to FA? or directly to ú within the code? – user2724841 Sep 03 '13 at 15:32
That sounds unecessarilly complciated, convert '%C3%BA' to 'Ãº', then convert 'Ãº' to 'ú'. Break a complciated problem down into smaller steps. The first step you already have, the second step is UTF-8 decoding. – john Sep 03 '13 at 15:46
Yes UTF-8 decoding would convert `{ (char)0xC3, (char)0xBA }` to code 0xFA, to wide char ú. – Joop Eggen Sep 03 '13 at 15:47
I have made my answer a bit more concrete. – Joop Eggen Sep 03 '13 at 15:54

C++ character encoding UTF-8

1 Answers1