I read a lot about Unicode, ASCII, code pages, all the history, the invention of UTF-8, UTF-16 (UCS-2), UTF-32 (UCS-4) and who use them and so on, but I still having some questions that I tried hardly to find answers but I couldn't and I hope you to help me.
1 - Unicode is a standard for encoding characters and they specify a code point for each character. Something like U+0000 (example). Imagine that I have a file that has those code points (\u0000), in which point of my application I'm going to use it?
This might be a silly question but I really don't know in which point of my application I'm going to use it.
I'm creating an application that can read file that has those code points using the escape \u
and I know that I can read it, decode it but now the next question.
2 - To which character set (code page) do I need to convert it? I saw some C++ libraries that they uses the name utf8_to_unicode
or utf8-to-utf16
and also only utf8_decode
, and this is what makes me confuse.
I don't know if will appear answers like this, but some might say: You need to convert it into code pages that you are going to use, but what if my application needs to be internationalized?
3 - I was wondering, in C++ if I try to display non-ASCII characters on terminal I got some confusing words. The question is: What makes the characters to be displayed are the fonts?
#include <iostream>
int main()
{
std::cout << "ö" << std::endl;
return 0;
}
The output (Windows):
├Â
4 - In which part of that process the encoding enter? It encodes, takes the code point and try to find the word that is equal on the fonts?
5 = WebKit is an engine for rendering web pages in web browsers, if you specify the charset as UTF-8 it works nicely with all characters, but if I specify another charset it doesn't, doesn't matter the font that I'm using, what happen?
<html>
<head>
<meta charset="iso-8859-1">
</head>
<body>
<p>ö</p>
</body>
</html>
The output:
ö
Works using:
<meta charset="utf-8">
6 - Imagine now that I read the file, I encode it, I have all the code points and I need to save the file again. Do I need to save it encoded (\u0000) or I need to decode first to transform again into characters and then save?
7 - Why the word "unicode" is a bit overloaded and is sometimes understood to mean utf-16? (source)
That's all for now. Thanks in advance.