Modifying string with russian symbols in c language

Question

I have code:

#include <stdio.h>

int main() {
  char abc[] = "Hello";
  abc[0] = 'm';
  printf("%s\n", abc);
  return 0;
}

It prints 'mello' and everything works correctly.

I have another code:

#include <stdio.h>

int main() {
  char abc[] = "Привет";
  abc[0] = 'm';
  printf("%s\n", abc);
  return 0;
}

It prints 'm?ривет'. Whats wrong with russian symbols?

The string may be encoded as [UTF-8](https://en.wikipedia.org/wiki/UTF-8), in which case the Russian letters take multiple bytes. Try printing the `strlen` of `abc`. — user3386109, Mar 14 '19 at 06:39
Ok so this is really a unix question but I'll point you in the right direction. When you have a system, you need locales and langauges enabled. Please see this thread on encoding: https://stackoverflow.com/questions/10017328/unicode-stored-in-c-char — BitShift, Mar 14 '19 at 06:40
And this tutorial on enabling locales in Linux: https://www.cyberciti.biz/faq/how-to-set-locales-i18n-on-a-linux-unix/ — BitShift, Mar 14 '19 at 06:41
If you are using windows os then use `wchar` instead of char — Mayur, Mar 14 '19 at 06:45
@user3386109 Regardless of encoding, we can be sure that Russian needs more bytes. — meaning-matters, Mar 14 '19 at 07:29
Actually, if you use a Cyrillic code set such as ISO 8859-5 or MS Windows CP 1251, then each Russian character is encoded as a single byte. If you use Unicode, then the characters require more than one byte, regardless of which encoding (UTF-8, UTF-16 or UTF-32) you use. (See http://czyborra.com/charsets/iso8859.html or http://czyborra.com/charsets/codepages.html for more information.) — Jonathan Leffler, Mar 14 '19 at 07:33

chqrlie · Accepted Answer · 2019-03-14T08:08:36.803

Russian letters are encoded in UTF-8 on your system. They use 2 bytes for each Cyrillic letter. You cannot change letters by changing individual char elements in the strings, you must construct new strings from substrings.

Here is a program to illustrate how the encoding works:

#include <stdio.h>
#include <string.h>

int utf8_length(const char *s) {
    if (*s < 128)
        return 1;   // regular ASCII byte
    if (*s < 128+64)
        return -1;  // continuation byte, invalid code point
    if (*s < 128+64+32)
        return 2;   // code-point encoded on 2 bytes
    if (*s < 128+64+32+16)
        return 3;   // code-point encoded on 3 bytes
    if (*s < 128+64+32+16+8)
        return 4;   // code-point encoded on 4 bytes
    return -1;      // invalid code point
}

void test(const char *s) {
    int len = strlen(s);
    int i, nbytes;

    printf("Hex representation of %s:\n", s);
    for (i = 0; i <= len; i++) {
        printf("%02X ", (unsigned char)s[i]);
    }
    printf("\n");
    for (i = 0; i < len; i += nbytes) {
        nbytes = utf8_length(s + i);
        if (nbytes < 0) {
            printf("invalid encoding at %d\n", i);
        } else {
            printf("%*s%.*s ",
                   nbytes * 3 - 2 - (nbytes > 2), "",
                   nbytes, s + i);
        }
    }
    printf("\n\n");
}

int main() {
    char buf[128];
    char abc[] = "Привет";

    test("hello");  // English
    test(abc);      // Russian
    test("你好");   // Mandarin

    strcpy(buf, "m");
    strcat(buf, abc + utf8_length(abc));

    printf("modified string: %s\n", buf);
    test(buf);

    return 0;
}

Output:

Hex representation of hello:
68 65 6C 6C 6F 00
 h  e  l  l  o

Hex representation of Привет:
D0 9F D1 80 D0 B8 D0 B2 D0 B5 D1 82 00
    П     р     и     в     е     т

Hex representation of 你好:
E4 BD A0 E5 A5 BD 00
      你       好

modified string: mривет
Hex representation of mривет:
6D D1 80 D0 B8 D0 B2 D0 B5 D1 82 00
 m     р     и     в     е     т

Does this mean that when i am changing letter 'П' that requires 2 bytes to letter 'm' which requires 1 byte, 1 byte excess and system doesnt understand it? (Sorry for my English) — undefined7887, Mar 14 '19 at 08:14
Yes, when you change just the first byte of `П`, with `m`, the string is no longer correctly encoded: the byte `9F` is a continuation byte, it cannot start a code-point, that's why the terminal shows a `?` (the behavior might be different on a different system). Linux systems are usually configured to use the UTF-8 encoding, Windows and java systems use different systems called UTF-16 or UCS2 where characters use 16 bits each, which is OK for most languages. The `char` type in C almost always has 8 bits, but `wchar_t` is wider and can be used for simpler indexing, but the APIs are not simple. — chqrlie, Mar 14 '19 at 08:22

Modifying string with russian symbols in c language

1 Answers1