2

Let's assume I want to write a function to compare two Unicode characters. How should I do that? I read some articles around (like this) but still didn't got that. Let's take as input. It's in range 0x0800 and 0xFFFF so it will use 3 bytes to encode it. How do I decode it? bitwise operation to get 3 bytes from wchar_t and store into 3 chars? A code in example in C could be great.

Here's my C code to "decode" but obviously show wrong value to decode unicode...

#include <stdio.h>
#include <wchar.h>

void printbin(unsigned n);
int length(wchar_t c);
void print(struct Bytes *b);

// support for UTF8 which encodes up to 4 bytes only
struct Bytes
{
    char v1;
    char v2;
    char v3;
    char v4;
};

int main(void)
{
    struct Bytes bytes = { 0 };
    wchar_t c = '€';
    int len = length(c);

    //c = 11100010 10000010 10101100
    bytes.v1 = (c >> 24) << 4; // get first byte and remove leading "1110"
    bytes.v2 = (c >> 16) << 5; // skip over first byte and get 000010 from 10000010
    bytes.v3 = (c >> 8)  << 5; // skip over first two bytes and 10101100 from 10000010
    print(&bytes);

    return 0;
}

void print(struct Bytes *b)
{
    int v1 = (int) (b->v1);
    int v2 = (int)(b->v2);
    int v3 = (int)(b->v3);
    int v4 = (int)(b->v4);

    printf("v1 = %d\n", v1);
    printf("v2 = %d\n", v2);
    printf("v3 = %d\n", v3);
    printf("v4 = %d\n", v4);
}

int length(wchar_t c)
{
    if (c >= 0 && c < 0x007F)
        return 1;
    if (c >= 0x0080 && c <= 0x07FF)
        return 2;
    if (c >= 0x0800 && c <= 0xFFFF)
        return 3;
    if (c >= 0x10000 && c <= 0x1FFFFF)
        return 4;
    if (c >= 0x200000 && c <= 0x3FFFFFF)
        return 5;
    if (c >= 0x4000000 && c <= 0x7FFFFFFF)
        return 6;

    return -1;
}

void printbin(unsigned n)
{
    if (!n)
        return;

    printbin(n >> 1);
    printf("%c", (n & 1) ? '1' : '0');
}
tripleee
  • 175,061
  • 34
  • 275
  • 318
Jack
  • 16,276
  • 55
  • 159
  • 284
  • 1
    So you're asking about UTF-8? Unicode does not specify a *representation*; it defines as numeric value for each character, but it doesn't specify how those numeric values are represented. UTF-8 encodes each character as a sequence of 1 or more bytes. – Keith Thompson Aug 25 '14 at 00:33
  • Yes, UTF-8. I still don't get this. It does store these numeric values in a byte-sequence but how do I retrive/decode it? – Jack Aug 25 '14 at 00:35
  • 2
    Certainly many related posts all ready on stack-overflow. An old [utf8 effort of mine](http://stackoverflow.com/a/19917585/2410359). There are a number of subtleties such that it is easy to mis-code. Your code does not flag illegal sequences. Good luck – chux - Reinstate Monica Aug 25 '14 at 03:02
  • Read [this](http://en.m.wikipedia.org/wiki/Unicode_equivalence) first, then phrase your question again using the words "character" and/or "codepoint" in their precise neaning. – n. m. could be an AI Aug 25 '14 at 05:10
  • 2
    Comparison is much more complex than merely decoding. You need to understand [normalization](http://userguide.icu-project.org/transforms/normalization), or use a library which does (the link is to ICU). – tripleee Aug 25 '14 at 18:43
  • 1
    The answer I left at http://stackoverflow.com/a/148766/5987 is for C++ but it wouldn't be hard to convert to pure C. – Mark Ransom Aug 27 '14 at 03:07

1 Answers1

1

It's not at all easy to compare UTF-8 encoded characters. Best not to try. Either:

  1. Convert them both to a wide format (32 bit integer) and compare this arithmetically. See wstring_convert or your favorite vendor-specific function; or

  2. Convert them into 1 character strings and use a function that compares UTF-8 encoded strings. There is no standard way to do this in C++, but it is the preferred method in other languages such as Ruby, PHP, whatever.


Just to make it clear, the thing that is hard is to take raw bits/bytes/characters encoded as UTF_8 and compare them. This is because your comparison has to take account of the encoding to know whether to compare 8 bits, 16 bits or more. If you can somehow turn the raw data bits into a null-terminated string then the comparison is trivially easy using regular string functions. This string may be more than one byte/octet in length, but it will represent a single character/code point.


Windows is a bit of a special case. Wide characters are short int (16-bit). Historically this meant UCS-2 but it has been redefined as UTF-16. This means that all valid characters in the Basic Multilingual Plane (BMP) can be compared directly, since they will occupy a single short int, but others cannot. I am not aware of any simple way to deal with 32-bit wide characters (represented as a simple int) outside the BMP on Windows.

david.pfx
  • 10,520
  • 3
  • 30
  • 63
  • 1
    `wchar_t` is not 32 bits on windows. – michaelmeyer Aug 25 '14 at 03:32
  • 1
    @doukremt wchar_t on windows is not unicode compliant, but no one prevents you from rolling your own (or using char32_t). – n. m. could be an AI Aug 25 '14 at 05:15
  • @doukremt: I agree, but this will only work outside the BMP if you can find a 32 bit function. If you only need BMP then 16 bit is enough. – david.pfx Aug 25 '14 at 06:37
  • Can you expand on your statement that "it is not easy"? Regular `string` functions work just fine, comparing one UTF8 string to another. (I assume you are not mixing this up with *validating* a UTF8 string, or *normalizing* Unicode codepoints.) – Jongware Aug 26 '14 at 21:51
  • @Jongware: You misunderstand. My answer already said that. See edit. – david.pfx Aug 27 '14 at 01:22
  • `wchar_t` on Windows is used for holding UTF-16 encoded characters, where two `wchar_t` values are used together for surrogates. – Remy Lebeau Aug 27 '14 at 01:36