How to convert char cyrillic array to array of unicode code of every char symbol?

Question

I have something like

char[] = "тест";

and i need to conver every symbol to code of this symbol. (А - 53392
Б - 53393 В - 53394 etc ) Now i use

char symb = 'у';
int number = symb - ' ';

or

int symbol = 'У'

but it works only for one symbol and i had error

warning: multi-character character constant [-Wmultichar]

I m trying using

long int str[] = { 'А' , 'Б', 'В'};
printf("char_offset:%d\n", str[1]);

and its working, but it's not easy to declare strings with many symbols this method. And i had this errors

Xlib1.c:295:17: warning: multi-character character constant [-Wmultichar]
   int str[] = { 'А' , 'Б', 'В'};
                 ^
Xlib1.c:295:24: warning: multi-character character constant [-Wmultichar]
   int str[] = { 'А' , 'Б', 'В'};
                        ^
Xlib1.c:295:30: warning: multi-character character constant [-Wmultichar]
   int str[] = { 'А' , 'Б', 'В'};
                              ^

But its working. i use this keys with gcc

 -finput-charset=UTF-8 -std=c11 -fextended-identifiers

I need to use this code on stm32. Help me to convert string with cyrillic characters to array of int codes of characters in string

`char` has only 8 bits, therefore it isn't enough to store those large Unicode codepoints. Where are you using those strings? Do the receiving functions support Unicode? — phuclv, Mar 24 '19 at 10:09
what type of array i can use instead of char? I m trying using long int str[] = { 'А' , 'Б', 'В'}; printf("char_offset:%d\n", str[1]); and its working, — Vasiliy Platon, Mar 24 '19 at 10:50
it doesn't work. The compiler already gave you a lot of useful warnings like "multi-character character constant [-Wmultichar]". It might *look* like it's working because in C there's [multi-character literals](https://stackoverflow.com/q/3960954/995714) like `'ABCD'`, but that's **not** a char that one expects in a string. It's entirely unclear what you want to do with the characters, but you must store them as a string instead, or use `wchar_t` (which is not a good idea) — phuclv, Mar 24 '19 at 11:05
For example `А` and `Б` are Unicode [U+0410](https://www.fileformat.info/info/unicode/char/0410/index.htm) and [U+0411](https://www.fileformat.info/info/unicode/char/0411/index.htm) which are 1040 and 1041 in decimal instead, not 53392 as you saw in the input, because the multi-character literal are often — phuclv, Mar 24 '19 at 11:14
It isn't clear (to me) from your question what you're trying to do, nor what you're using as input or output character set encodings. You could look at using the C90 wide-character encodings with an `L` prefix: `L'Б'` or `L"Б"`. Or you could look at using the C11 Unicode encodings: `u` , `u8` and `U` as prefixes in place of `L`. (See C11 [§6.4.4.4 Character constants](https://port70.net/~nsz/c/c11/n1570.html#6.4.4.4) and [§6.4.5 String literals](https://port70.net/~nsz/c/c11/n1570.html#6.4.5) for more information.) Do you want UTF-32, or UTF-16, or UTF-8 as output? What's the input code set? — Jonathan Leffler, Mar 24 '19 at 15:38
I write function, that convert cyrillic symbol to unicode code. Tomorrow i can post it here. And i managed that cyrillic symbol can be stored with 2 bytes smth like `char symb[2] = "Б" ` Thanks all. — Vasiliy Platon, Mar 24 '19 at 18:10
that's not a solution either. **Use UTF-8** instead. `char symb[2] = "Б"` seems to work because Cyrillic characters is encoded using 2 bytes in UTF-8. But that'll fail miserably for a lot of other characters. `wchar_t symb = L'Б';` is better, but still not as good as UTF-8 — phuclv, Mar 25 '19 at 01:52
But what use instead of char to declare a string of symbols? I posted my code in answer to my quetions. All seems to work perfectly now. — Vasiliy Platon, Mar 25 '19 at 08:10

Vasiliy Platon · Accepted Answer · 2019-03-25T08:15:22.227

1

Here is my function to convert unicode symbols. I add checks at end of the function. Thanks @phuclv to his reply.

int UniCyrConv(char *str, char *unicode_code)
{
        int num1=256+(int)str[0];      //first unicod byte
            int num2=256+(int)str[1];      // second
        int conv1 = (num1 & 31)*64;    // remove 3 first bits and adding 6 zero to end
        int conv2 = (num2 & 63);       // remove 2 first bits
        int final = (conv1 | conv2);   // 1 + 2
        DecToHex(final, unicode_code); /// to hex      
        return final;
}

check if symbol cyrillic

        if ( (final  >= 1040) && (final <= 1103) ){
        DecToHex(final, unicode_code); /// to hex      
        return final;
        }
        else { return -1; }

edited Mar 25 '19 at 08:15

answered Mar 25 '19 at 08:08

Vasiliy Platon

29
6

If this is not an answer, please edit your question and add this info there. – YesThatIsMyName Mar 25 '19 at 08:12
1

it's answer. Thanks. – Vasiliy Platon Mar 25 '19 at 08:15

How to convert char cyrillic array to array of unicode code of every char symbol?

1 Answers1