1

Old Question: How SubString,Limit Using C? ,But no one did not answer my question.

i want get one index from a string.

my string may contains symbol and utf-8 character.(eg:ß)

speed of string for me is important.

1#: w_char_t data type good for me?

2#: how can get a character from a utf-8 string?

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <string.h>

int main()
{
wchar_t *msg1 = L"ßC Programming";
//wprintf(L" vals> %Ls\n",msg1);
//wprintf(L" vals> %s\n",msg1);
printf(" vals> %Ls %S\n",msg1,msg1);//dont show any=====>BUG
printf(" val> %Lc\n",msg1[1]);//show `C`
printf(" val> %Lc\n",msg1[0]);//dont show any=====>BUG
printf("\n");
/////////////////////////////////
char *msg2 = "ßC Programming";
printf(" vals> %s\n",msg2);//show `ßC Programming`
printf(" val> %c\n",msg2[1]);//show `�`=====>BUG
printf(" val> %c\n",msg2[0]);//show `�`=====>BUG
printf("\n");
}

Please guide me in solving problems.

Community
  • 1
  • 1
GoWorkCode
  • 11
  • 8

1 Answers1

1

wchar_t can be an option. You should be aware about the encoding it uses, though. If it is 16 bit wide, utf-16 used (common, but not guaranteed) and you are using code points equal to or higher than 0x10000 (U+10000), you have the same problem again...

I personally would rather stay with normal char, though.

Question is now, how to detect multibyte characters. You can spot these by looking at the most significant bit: If it is not set, you have a normal character (ASCII compatible...), if it is set, the byte is part of a multibyte character.

If the second MSB is set, too, it is the start byte of a multi-byte sequence, if it is not set, it is a follow up byte.

Format of a utf-8 multibyte sequence is as follows:

First byte: n most significant bits being set to 1 specify how many bytes the entire sequence comprises, followed by a zero-bit. Remaining bits are the most significant bits of your unicode code point.

Each subsequent byte has 10 as most significant bits, remaining 6 bits are the next most significant bits of your code point.

Example letter 'ß': It has unicode code point 0xdf, binary 0b11011111.

Requiring 8 bits, not fitting into the seven for a single byte character, so we need to split it:

11 + 011111

We need two bytes in total, so we need to add the byte headers 110 and 10; first byte must then be filled up with zeros:

110 000 11 + 10 011111

So you get the byte sequence 0b11000011, 0b10011111 (hexadecimal: 0xc3, 0x9f).

There are, though, libraries facilitating this. You might be interested in ICU, for instance.

Aconcagua
  • 24,880
  • 4
  • 34
  • 59
  • ICU is a library for C? – GoWorkCode Apr 20 '17 at 08:50
  • @GoWorkCode Citing their site: "ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications." – Aconcagua Apr 20 '17 at 08:52
  • raspberry pi Microcontroller and Windows support? – GoWorkCode Apr 20 '17 at 08:53
  • without ICU can not get utf-8 char? but many project without lib do this! – GoWorkCode Apr 20 '17 at 08:56
  • About C support: http://icu-project.org/apiref/icu4c/. Micro controllers, you need to check yourself (you might just try to compile the library from source), Windows support yes ("For Microsoft Visual Studio, the /utf-8 option is set in ICU's .vcxproj files."). ICU is just a helper, if you don't want to or cannot use it, you still can decompose utf manually as I described in my answer... – Aconcagua Apr 20 '17 at 09:02
  • You might want to have a look at [here](http://stackoverflow.com/questions/6240055/manually-converting-unicode-codepoints-into-utf-8-and-utf-16), too... – Aconcagua Apr 20 '17 at 09:04
  • "*any code point you need not fitting into 15 bits (MSB is reserved, too)*" - that is not how UTF-16 works. There are no reserved bits. Codepoints up to 16bits can fit in a 16bit `wchar_t`/`char16_t` just fine (this is backwards compatible with UCS-2, UTF-16's predecessor). Codepoints higher than 16bits are re-encoded to 20bits that are spread across two adjoining 16bit `wchar_t`/`char16_t` values with prefixes added to them. – Remy Lebeau Apr 20 '17 at 18:25
  • @RemyLebeau Thanks, fixed the technical inaccuracy. My point was with utf-16, too, we can have multibyte characters, so it does not *necessarily* help us out... – Aconcagua Apr 20 '17 at 18:36