2

In this question: Convert ISO-8859-1 strings to UTF-8 in C/C++

There is a really nice concise piece of c++ code that converts ISO-8859-1 strings to UTF-8.

In this answer: https://stackoverflow.com/a/4059934/3426514

I'm still a beginner at c++ and I'm struggling to understand how this works. I have read up on the encoding sequences of UTF-8, and I understand that <128 the chars are the same, and above 128 the first byte gets a prefix and the rest of the bits are spread over a couple of bytes starting with 10xx, but I see no bit shifting in this answer.

If someone could help me to decompose it into a function that only processes 1 character, it would really help me understand.

Community
  • 1
  • 1
user230910
  • 2,353
  • 2
  • 28
  • 50
  • 1
    What is inside the `while(*in)` is just the code you're asking for – the `if...else...` converts a single `*in` input character (and advances the `in` pointer, so that the `while()` loop may to iterate the conversion on the next one). – CiaPan Sep 29 '15 at 08:07
  • I wouldn't call the given code "*really nice"*. Lack of braces and white space, and overlong lines make it quite messy. – user694733 Sep 29 '15 at 08:16
  • you have a point, i meant nice = efficient, not nice = maintainable – user230910 Sep 30 '15 at 13:13

1 Answers1

1

Code, commented.

This works on the fact that Latin-1 0x00 through 0xff are mapping to consecutive UTF-8 code sequences 0x00-0x7f, 0xc2 0x80-bf, 0xc3 0x80-bf.

// converting one byte (latin-1 character) of input
while (*in)
{
    if ( *in < 0x80 )
    {
        // just copy
        *out++ = *in++;
    }
    else
    {
         // first byte is 0xc2 for 0x80-0xbf, 0xc3 for 0xc0-0xff
         // (the condition in () evaluates to true / 1)
         *out++ = 0xc2 + ( *in > 0xbf ),

         // second byte is the lower six bits of the input byte
         // with the highest bit set (and, implicitly, the second-
         // highest bit unset)
         *out++ = ( *in++ & 0x3f ) + 0x80;
    }
}

The problem with a function processing a single (input) character is that the output could be either one or two bytes, making the function a bit awkward to use. You are usually better off (both in performance and cleanliness of code) with processing whole strings.

Note that the assumption of Latin-1 as input encoding is very likely to be wrong. For example, Latin-1 doesn't have the Euro sign (), or any of these characters ŠšŽžŒœŸ, which makes most people in Europe use either Latin-9 or CP-1252, even if they are not aware of it. ("Encoding? No idea. Latin-1? Yea, that sounds about right.")

All that being said, that's the C way to do it. The C++ way would (probably, hopefully) look more like this:

#include <unistr.h>
#include <bytestream.h>

// ...

icu::UnicodeString ustr( in, "ISO-8859-1" );

// ...work with a properly Unicode-aware string class...

// ...convert to UTF-8 if necessary.
char * buffer[ BUFSIZE ];
icu::CheckedArrayByteSink bs( buffer, BUFSIZE );
ustr.toUTF8( bs );

That is using the International Components for Unicode (ICU) library. Note the ease this is adopted to a different input encoding. Different output encodings, iostream operators, character iterators, and even a C API are readily available from the library.

DevSolar
  • 67,862
  • 21
  • 134
  • 209
  • Thank you so much for this, i seriously appreciate it! – user230910 Sep 30 '15 at 11:42
  • Ok, so that means the & is filtering out the top 2 bits of the input, therefore no bit shifting required – user230910 Sep 30 '15 at 13:17
  • @user230910: `0x3f` is binary `00111111`, so yes -- it's filtering the top 2 bits. They are not significant for the second byte of UTF-8 (in this specific case of Latin-1 input). – DevSolar Sep 30 '15 at 13:27
  • That ^ was the missing key to understanding what was happening, thank you again for all the effort you put in to help me understand! – user230910 Sep 30 '15 at 15:12