1

So I have standard C string:

char* name = "Jakub";

And I want to convert it to UTF-16. I figured out, that UTF-16 will be twice as long - one character takes two chars.
So I create another string:

char name_utf_16[10];  //"Jakub" is 5 characters

Now, I believe with ASCII characters I will only use lower bytes, so for all of them it will be like 74 00 for J and so on. With that belief, I can make such code:

void charToUtf16(char* input, char* output, int length) {
    /*Todo: how to check if output is long enough?*/
    for(int i=0; i<length; i+=2)  //Step over 2 bytes
    {
        //Lets use little-endian - smallest bytes first
        output[i] = input[i];
        output[i+1] = 0;  //We will never have any data for this field
    }
}

But, with this process, I ended with "Jkb". I know no way to test this properly - I've just sent the string to Minecraft Bukkit Server. And this is what it said upon disconnecting:

13:34:19 [INFO] Disconnecting jkb?? [/127.0.0.1:53215]: Outdated server!

Note: I'm aware that Minecraft uses big-endian. Code above is just an example, in fact, I have my conversion implemented in class.

Tomáš Zato
  • 50,171
  • 52
  • 268
  • 778
  • 7
    You should use an existing UTF-16 encoder, creating a robust one yourself is not an easy task. – Esailija Mar 16 '13 at 12:57
  • It wouldn't be indeed, if I wanted to be able to use whole character table. But I just want to fit 256 ASCII characters in! Is that complicated task too? – Tomáš Zato Mar 16 '13 at 13:00
  • Well, nice from you to tell me. But could you please point out, where I started to go wrong with my assumptions? – Tomáš Zato Mar 16 '13 at 13:22
  • 1
    @TomášZato: «But I just want to fit 256 ASCII characters in!» The ASCII characters are only 128, the upper half of the so-called codepage is locale-specific, so you will get weird (and locale-dependent) results for any character outside the first 128. Also, as stated multiple times, the standard library already provides `mbstowcs` (which will work for any character in the current locale) for this task. – Matteo Italia Mar 16 '13 at 14:00

3 Answers3

7

Before I answer your question, consider this:

This area of programming is full of man traps. It makes a lot of sense to understand the differences between ASCII, UTF7/8 and ANSI/'MultiByte Character Strings (MBCS)', all of which to an english speaking programmer will look and feel identical, but need very different handling if they are introduced to a european or asian user.

ASCII: Characters are in range 32-127. only ever one byte. The clue is in the name, they are great for Americans, but not fit for purpose in the rest of the world.

ANSI/MBCS: This is the reason for 'code pages'. Characters 32-127 are the same as ASCII, but it is possible to have characters in the range of 128-255 as well for additional characters, and some of the 128-255 range can be used as a flag to mark that the character continues into a second, third or even fourth byte. To process the string correctly, you need both the string bytes and the correct code page. If you try processing the string using the wrong code page you will not have the right characters, and misinterpret whether a character is a one, two or even 4 byte character.

UTF7/8: These are 8-bit wide formatting of 21-bit unicode character points. in UTF-7 and UTF-8 unicode characters can be between one and four bytes long. The advantage that UTF encodings have over ANSI/MBCS is that there is no ambiguity caused by code pages. Each glyph in every script has a unique unicode code point, which means it is not possible to mangle the character sets by interpreting the data on a different computer with different regional settings.

So to to start to answer your question:

  1. Whilst you are making the assumption that your char* will only point to an ASCII string, that is a really dangerous choice to make, users are in control of data that is typed in, not the programmer. Windows programs will be storing this as MBCS by default.

  2. You are making the second assumption is that a UTF-16 encoding will be twice the size of an 8 bit encoding. That is not generally a safe assumption. depending on the source encoding the UTF-16 encoding may be twice the size, may be less than twice the size, and in an extreme example may actually be shorter in length.

So, what is the safe solution?

The safe option is to implement your application internally as unicode. On windows, this is a compiler option, and then means your windows controls all use wchar_t* strings for their data type. On linux I'm less sure that you can always use unicide graphics and OS libraries. You must also use the wcslen() functions to get the length of strings etc. When you interact with the outside world, be precise in the character encodings used.

To answer to your question then becomes changing the question to, what do i do when I receive non UTF-16 data?

Firstly, be very clear about what assumptions on its formatting are you making? and secondly, accept the fact that sometimes the conversion to UTF-16 may fail.

If you are clear on the source formatting, you can then choose the appropriate win32 or the stl converter to convert the format, and you should then look for evidence the conversion failed before using the result. e.g. mbstowcs in or MultiByteToWideChar() on windows. However the use of both of these approaches safely means you need to understand ALL of the above answer.

All other options introduce risk. Use mbcs strings and you will have data strings mangled by being entered using one code page, and processed using a different code page. Assume ASCII data, and when you encounter a non ascii character your code will break, and you will 'blame' the user for your short comings.

Michael Shaw
  • 604
  • 7
  • 15
  • there are libraries that specifically designed to deal with unicode-related problems. Unicode also evolves and system encoding\decoding functions may fail if they meet an outdated or too new codepoint. The problem with mbstowcs is it false portability, as Win platform supports own flavor of it (being blatantly noncompliant to C11) – Swift - Friday Pie Dec 28 '21 at 15:30
5

Why do you want to make your own Unicode conversion functionality when theres's existing C/C++ functions for this, like mbstowcs() which is included in <cstdlib>.

If you still want to make your own stuff, then have a look at Unicode Consortium's open source code which can be found here:

Convert UTF-16 to UTF-8 under Windows and Linux, in C

Community
  • 1
  • 1
  • 1
    `mbstowcs` is not necessarily utf-16 since it's locale-specific. C++11 has `codecvt`, which might be a better example. – prideout Jul 06 '13 at 16:53
  • @prideout So, it's potentially UTF-32 in S. Korea, China, and Japan? –  Jul 06 '13 at 18:29
  • The wide characters in `mbstowcs` are potentially 4 bytes. For example, `zh_CN.UTF-8` and `zh_CN.GB2312` are both valid locales for China, but they use different character encodings. – prideout Jul 07 '13 at 00:27
0
output[i] = input[i];

This will assign every other byte of the input, because you increment i by 2. So no wonder that you obtain "Jkb". You probably wanted to write:

output[i] = input[i / 2];
user2155932
  • 760
  • 3
  • 9
  • oh no, how can I be so lame. Thank you :) – Tomáš Zato Mar 16 '13 at 13:28
  • That's not so easy... the conversion depends from the input encoding, padding with zeroes will only work for the characters in range 0-127 and only if the original encoding is ASCII-based. Also, the standard library already provides `mbstowcs`, so it's useless to implement a custom (broken) solution like that. – Matteo Italia Mar 16 '13 at 13:49
  • 1
    Which OP has already mentioned. The solution only needs to be good for OP's case, it does not have to be good for all possible cases. – Dialecticus Mar 16 '13 at 13:54
  • 2
    @Dialecticus: that's no point in helping to build a broken solution when a working one is already available in the standard library. – Matteo Italia Mar 16 '13 at 13:58