Unicode, UTF-8, UTF-16 and UTF-32 questions

Question

I read a lot about Unicode, ASCII, code pages, all the history, the invention of UTF-8, UTF-16 (UCS-2), UTF-32 (UCS-4) and who use them and so on, but I still having some questions that I tried hardly to find answers but I couldn't and I hope you to help me.

1 - Unicode is a standard for encoding characters and they specify a code point for each character. Something like U+0000 (example). Imagine that I have a file that has those code points (\u0000), in which point of my application I'm going to use it?

This might be a silly question but I really don't know in which point of my application I'm going to use it. I'm creating an application that can read file that has those code points using the escape \u and I know that I can read it, decode it but now the next question.

2 - To which character set (code page) do I need to convert it? I saw some C++ libraries that they uses the name utf8_to_unicode or utf8-to-utf16 and also only utf8_decode, and this is what makes me confuse.

I don't know if will appear answers like this, but some might say: You need to convert it into code pages that you are going to use, but what if my application needs to be internationalized?

3 - I was wondering, in C++ if I try to display non-ASCII characters on terminal I got some confusing words. The question is: What makes the characters to be displayed are the fonts?

#include <iostream>

int main()
{
    std::cout << "ö" << std::endl;

    return 0;
}

The output (Windows):

├Â

4 - In which part of that process the encoding enter? It encodes, takes the code point and try to find the word that is equal on the fonts?

5 = WebKit is an engine for rendering web pages in web browsers, if you specify the charset as UTF-8 it works nicely with all characters, but if I specify another charset it doesn't, doesn't matter the font that I'm using, what happen?

<html>
<head>
    <meta charset="iso-8859-1"> 
</head>
<body>
    <p>ö</p>
</body>
</html>

The output:

Ã¶

Works using:

<meta charset="utf-8">

6 - Imagine now that I read the file, I encode it, I have all the code points and I need to save the file again. Do I need to save it encoded (\u0000) or I need to decode first to transform again into characters and then save?

7 - Why the word "unicode" is a bit overloaded and is sometimes understood to mean utf-16? (source)

That's all for now. Thanks in advance.

characters is ambiguous: It is often used for code-unit, but properly refers to code-point or even grapheme. Use a less ambiguous term here. — Deduplicator, Jun 30 '14 at 18:24
Looks like you want a complete primer on Unicode. Did you read the wikipedia pages? — Deduplicator, Jun 30 '14 at 18:26
Related reading (Should also answer point 7, if not all of them): http://www.utf8everywhere.org — Deduplicator, Jun 30 '14 at 18:35
You should ask *one* question at a time, and you should consider whether it is really a programming question that is on-topic at SO. Besides, the questions should be well-formulated, answerable questions. For example, I cannot see what you are asking in your question 1; you are not telling anything about your application, yet you are asking others to tell what to do with some characters in it. — Jukka K. Korpela, Jun 30 '14 at 19:13

bames53 · Accepted Answer · 2014-06-30T19:17:29.707

I'm creating an application that can read file that has those code points using the escape \u and I know that I can read it, decode it but now the next question.

If you're writing a program that processes some kind of custom escapes, such as \uXXXX, it's entirely up to you when to convert these escapes into Unicode code points.

To which character set (code page) do I need to convert it?

That depends on what you want to do. If you're using some other library that requires a specific code page then it's up to you to convert data from one encoding into the encoding required by that library. If you don't have any hard requirements imposed by such third party libraries then there may be no reason to do any conversion.

I was wondering, in C++ if I try to display non-ASCII characters on terminal I got some confusing words.

This is because various layers of the technology stack use different encodings. From the sample output you give, "├Â" I can see that what's happening is that your compiler is encoding the string literal as UTF-8, but the console is using Windows codepage 850. Normally when there are encoding problems with the console you can fix them by setting the console output codepage to the correct value, unfortunately passing UTF-8 through std::cout currently has some unique problems. Using printf instead worked for me in VS2012:

#include <cstdio>
#include <Windows.h>

int main() {
    SetConsoleOutputCP(CP_UTF8);
    std::printf("%s\n", "ö");
}

Hopefully Microsoft fixes the C++ libraries if they haven't already done so in VS 14.

In which part of that process the encoding enter? It encodes, takes the code point and try to find the word that is equal on the fonts?

Bytes of data are meaningless unless you know the encoding. So the encoding matters in all parts of the process.

I don't understand the second question here.

if you specify the charset as UTF-8 it works nicely with all characters, but if I specify another charset it doesn't, doesn't matter the font that I'm using, what happen?

What's going on here is that when you write charset="iso-8859-1" you also have to actually convert the document to that encoding. You're not doing that and instead you're leaving the document as UTF-8 encoded.

As a little exercise, say I have a file that contains the following two bytes:

0xC3 0xB6

Using information on UTF-8 encoding and decoding, what codepoint do the bytes decode to?

Now using this 8859-1 codepage, what do the same bytes decode to?

As another exercise, save two copies of your HTML document, one using charset="iso-8859-1" and one with charset="utf-8". Now use a hex editor and examine the contents of both files.

Imagine now that I read the file, I encode it, I have all the code points and I need to save the file again. Do I need to save it encoded (\u0000) or I need to decode first to transform again into characters and then save?

This depends on the program that will need to read the file. If the program expects all non-ASCII characters to be escaped like that then you have to save the file that way. But escaping characters with \u is not a normal thing to do. I only see this done in a few places, such as JSON data and C++ source code.

Why the word "unicode" is a bit overloaded and is sometimes understood to mean utf-16?

Largely because Microsoft uses the term this way. They do so for historical reasons: When they added Unicode support they named all their options and setting "Unicode" but the only encoding they supported was UTF-16.

Unicode, UTF-8, UTF-16 and UTF-32 questions

1 Answers1