How to read a unicode string in form of individual ascii chars and detect that it was actually unicode in the fastest way possible?

Question

I am making a library which lets user insert and search key-value pairs as trie data structure. When I insert a unicode string, it breaks down into 4 characters(utf-8)(which is okay), but each character becomes ‘?’. So I tried using setlocale(LC_ALL, "") which didn’t work (or maybe I just dunno what are the right arguments for my case and where to call it). I don’t really care about printing or reading the character as it is. All I want is that it can somehow be represented uniquely.

In my trie there are links like node *next[256].

So all I want is when a unicode string gets inserted, it gets inserted as a unique combination which would make it possible to search that string uniquely. Also I want a way to detect that a unicode character was broken down into 4 individual chars. Thats because, e.g., If in a string “wxyz" a unicode character “x” is broken down into a,b,c,d then trie would store “wabcdyz”. But if I was actually searching a string wabcdyz(not unicode), then it would find the entry for that string but that would be a mismatch.

Here is a program that shows the unicode character being broken down into four ? characters:

#include <stdio.h>

int main()
{
    printf("Hello World");

    char a[] = "Ƃ";

    int i;
    for(i = 0 ; a[i] != '\0' ; ++i)
    {
        printf("%c", a[i]);
    }

    return 0;
}

In my expirience utf-8 character was reprezent in 2 bytes, 2 chars. Utf first byte begin from 0xC0 - https://en.wikipedia.org/wiki/UTF-8, so you must store chars in bytes (uint8 / signed char) — Igor Galczak, May 22 '19 at 14:58
@IgorGalczak UTF-8 characters can be from one to four bytes (if you respect the artificial limitation of Unicode to 0x10FFFF code points) or one to six (if you stick with the original definition of UTF-8). I suspect OP is actually getting UTF-16 or -32. — zwol, May 22 '19 at 15:03
@Mihir By design, every byte of a UTF-8 multibyte sequence has its high bit set, so they can never collide with ASCII characters. As long as your trie structure does not clobber the high bit, it should be able to work with UTF-8 strings just like it does ASCII, without ever having to know the encoding explicitly. However, if your Unicode strings _consistently_ have four bytes per character they are probably not UTF-8, but UTF-32, which is much harder to work with. Please tell us what the numeric value of each of the four bytes in one of the problem characters is. — zwol, May 22 '19 at 15:05
On the flip side, however, every byte with the high bit clear is a valid single-byte UTF-8 code sequence, and any number of such bytes strung together is valid utf-8 encoded text. Also, there are many byte sequences containing some with the high bit set that are not valid UTF-8 code sequences. — John Bollinger, May 22 '19 at 15:11
@zwol, I see, all I need is that one of those unicode characters to be able to be representable as one of the characters from 0-256, because that’s how my trie is establishing links. It takes the character and converts it to numerical value and proceeds to next link accordingly. I would just need the most accurate and fastest way to achieve this. — Mihir Luthra, May 22 '19 at 15:12
"Unicode string"? In C you have an encoding. So you are meaning "UTF-8"? Or you are doing like python, an unicode interface and you handle (and hide) the details? — Giacomo Catenazzi, May 22 '19 at 15:13
@Mihir, I note that you say "Unicode", but tag [utf-8], and describe behavior that doesn't seem consistent with UTF-8. UTF-8 is *one way* of representing Unicode text, but not the only way. — John Bollinger, May 22 '19 at 15:14
https://stackoverflow.com/questions/3911536/utf-8-unicode-whats-with-0xc0-and-0x80, as per this I see a way to detect if sequence of bytes is a unicode character, but this only seems to be the case in utf-8. If my implementation requires utf-32, what shall I do then? — Mihir Luthra, May 22 '19 at 15:15
@Mihir It is impossible to convert an arbitrary Unicode character to a number in `[0, 255]` without losing information, because of the [pigeonhole principle](https://en.wikipedia.org/wiki/Pigeonhole_principle). But nothing will go wrong if you give each byte of a UTF-8 multibyte sequence its own node in the trie. — zwol, May 22 '19 at 15:18
@zwol, each byte in multibyte sequence of utf-8 would get node in trie. One problem that I still face is that even after `setlocale()`, it still reads unicode characters like ‘?’ ‘? ‘?’ ‘?’. How do I make it read the right character? — Mihir Luthra, May 22 '19 at 15:28
@Mihir Two questions: (1) Why does your test program use `int a[] = "Ƃ"`? That is definitely wrong. It won't even compile on my computer. (2) What does your program print, _on your computer_, if you change `%c` to `%02x` in the `printf` call? — zwol, May 22 '19 at 16:00
If I change `a`'s element type to `char` in the code presented, it compiles fine on my machine and prints the expected character when it runs. I note, however, that the character `Ƃ` is not a member of C's basic source or execution character set. It is up to the implementation whether to accept that character appearing as itself in C source code, and to what it corresponds at execution time. — John Bollinger, May 22 '19 at 16:06
@Mihir Still waiting for the numeric values of those bytes... — zwol, May 22 '19 at 16:10
*unicode string in form of individual ascii chars*. That's an oxymoron right here. Anyway [your code works just fine](https://ideone.com/UT3jac). My guess is you have a broken OS with no proper UTF-8 support, but you are not sharing any details of your environment, so no further progress can be made. — n. m. could be an AI, May 22 '19 at 16:16
@zwol, maybe I understand it a bit now, ffffffc6 breaks to index “-58” and there is no index such as that. Does that mean the each node in utf-8 sequence isn’t within range 0-255? — Mihir Luthra, May 22 '19 at 16:19
@JohnBollinger a general purpose compiler that does not allow Unicode in source code may be conformant but it doesn't make it any less broken and useless. Just throw it away and get one that does the right thing. It's not that there's any kind of shortage out there. — n. m. could be an AI, May 22 '19 at 16:22
@Mihir That looks like unwanted sign extension to me, not actual 0xFF bytes in your string literal. What does it print if you change that entire `printf` line to read `printf("%02x", (unsigned int)(unsigned char)a[i]);` ? — zwol, May 22 '19 at 16:23
@zwol, it is now c6. Well my program works finally. That was the reason my data structure was failing. the time I changed it to `unsigned char *str` it works correctly. Thank you so much for your help. — Mihir Luthra, May 22 '19 at 16:33
@Mihir Warning: hex C6 is only the _first_ character of the multibyte sequence that make op Ƃ. You should also get a second byte with value hex 82. — Mr Lister, May 23 '19 at 11:43

John Bollinger · Accepted Answer · 2019-05-22T15:49:22.393

1

UTF-8 is a mechanism for encoding Unicode character sequences as byte sequences, but not the only way. Unicode does not imply UTF-8, and, technically, UTF-8 does not imply Unicode, either.

When I insert a unicode string, it breaks down into 4 characters(utf-8)

That is a function of how you store the string data, and

it sounds broken
it is probably not using UTF-8, contrary to your assertion

So all I want is when a unicode string gets inserted, it gets inserted as a unique combination which would make it possible to search that string uniquely.

That's relatively easy: encode all your strings the same way. Encoding all of them in UTF-8 would be my choice, but you may also use any other stateless encoding that supports all the characters that may appear in your strings, such as UTF-16 or UTF-32. But you must use a consistent encoding for all characters of all strings.

Having done that properly, you don't necessarily need to do anything else special to make your trie work.^* If you choose UTF-16 or UTF-32, however, then I would suggest structuring the trie around the size of their code units (16- or 32 bits, respectively). That's not necessary, but it will likely yield advantages in the form of shallower and therefore better-performing tries.

^* Note, however, that UTF-16 and UTF-32 code units include many encompassing bytes with value 0, such as 0x0031 and 0x00000200. If you do treat these as byte sequences instead of code-unit sequences, then you must account for that. In particular, you must avoid assuming that individual null bytes serve as terminators.

edited May 22 '19 at 15:49

answered May 22 '19 at 15:28

John Bollinger

160,171
8
81
157

That answers most of my question. But how should I even enable utf-8? e.g., I pass Ƃ as a string to get inserted. When I traverse through string character by char, it reads it as ‘?’ ‘? ‘?’ ‘?’. How should I make it read it correctly? – Mihir Luthra May 22 '19 at 15:33
2

@Mihir We really can't help you any further until you tell us the _numeric value_ of each of those four bytes. I can tell you right now that `Ƃ`, encoded in UTF-8, should produce the _two-byte_ sequence 0xC6 0x82, so that makes it practically certain that you are _not_ getting UTF-8, but I still don't know what you _are_ getting. It would also help if you showed us the exact code construct that you mean by "I pass Ƃ as a string to get inserted." – zwol May 22 '19 at 15:40
1

@JohnBollinger You should probably also mention that UTF-16 and -32 are incompatible with using C's one-byte `\0` string terminators; you can use U+0000 instead, but then you have to work with code units instead of bytes. – zwol May 22 '19 at 15:41
@Mihir Oh, also, we need to know whether or not you are using Windows. The answer to your `setlocale` issues changes depending on that. – zwol May 22 '19 at 15:43
@zwol, I use macOS. I have a string like this, `char str[] = “ Ƃ”;`. if i loop through each character in this string until ‘\0’ is encountered, it will loop 4 times and print ‘?’ 4 times. So simply, it doesn’t break the unicode character down correctly, instead it produces ‘?’. Sorry, I am pretty new to this stuff. I will make a short program to depict my problem and post its link here. – Mihir Luthra May 22 '19 at 15:49
@Mihir, supposing that your compiler conforms to C11 (which might require a command-line option), a good bet *for literals*, such as you describe, is to use utf8 literals: `char str[] = u8"Ƃ";`. For data read from external sources, though, you need to either assume that it will come in UTF-8 encoded, or you must be prepared to transcode it to UTF-8 (which requires knowing and supporting its original encoding). – John Bollinger May 22 '19 at 15:56
I tried this on some online compilers and they produce different outputs. On my machine it would print ‘?' 4 times – Mihir Luthra May 22 '19 at 15:56
1

@Mihir I copied your program into your question. In the future, please provide example code by editing the question, not by using pastebin; we want questions and answers to make sense years from now, and that doesn't work if they depend on links to external sites that may no longer be valid. (Did you know you can edit your question? The tiny gray word `edit` under the links is a button. Yes, really. Yes, it's bad UI design. Sorry.) – zwol May 22 '19 at 15:58
@zwol, I am sorry for that. Didn’t come to my mind. Also its just 1 week I joined stack overflow. I will make sure to do it correctly in future – Mihir Luthra May 22 '19 at 16:00
In any case, @Mihir, that what prints on the screen does not match what you see in your source file is reflective of a mismatch between assumed and actual encoding. If this happens when you have not messed with the locale, then that suggests that your source file's actual encoding differs from what the compiler expects. There could be an additional mismatch on the output side, too. All the way around, cooperating components need to know and agree about the encodings used. – John Bollinger May 22 '19 at 16:00
@Mihir No worries. We were all new to this site once. – zwol May 22 '19 at 16:01

How to read a unicode string in form of individual ascii chars and detect that it was actually unicode in the fastest way possible?

1 Answers1