0

I have to write config info to a file in Linux, while the config info contains Chinese characters.

Instead of using wchar_t,I just using char array, is this correct?

Here is my code :

code in paster.ubuntu

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>

#define MSG_LEN 4096

int save_config_info(const char *path, char* message)
{
    FILE *fp = NULL;

    fp = fopen(path, "wb");
    if (!fp)
    {
            //print error message
        return -1;
    }

    if (fwrite(message, 1, strlen(message), fp) != strlen(message)) 
        {
        //print error message
        fclose(fp);
        return -1;
    }

    fclose(fp);
    return 0;
}

int main()
{
    //config contain chinese character
    char str[MSG_LEN] = "配置文件中包含中文";
    char path[PATH_MAX] = "example.txt";
    save_config_info(path,str);

    return 0;
}

If the source code encoding is ISO-8859-1, generate the example.txt and using cat to show with some????.

But change the source code encoding with utf-8, everything works well.

My question is:

Is there any elegant way to deal with the Chinese character, since I cannot ensure the source file encoding.

I want the example.txt looks always right.

[root workspace]#file fork.c
fork.c: C source, ASCII text
[root workspace]#gcc -g -o fork fork.c
[root workspace]#
[root workspace]#./fork
[root workspace]#
[root workspace]#
[root workspace]#file example.txt
example.txt: ASCII text, with no line terminators
[root workspace]#
[root workspace]#cat example.txt
?????????[root workspace]#
[root workspace]#
[root workspace]#
[root workspace]#file fork.c
fork.c: C source, UTF-8 Unicode text
[root workspace]#
[root workspace]#gcc -g -o fork fork.c
[root workspace]#./fork
[root workspace]#
[root workspace]#file example.txt
example.txt: UTF-8 Unicode text, with no line terminators
[root workspace]#cat example.txt
配置文件中包含中文[root workspace]#
phuclv
  • 37,963
  • 15
  • 156
  • 475
J.Doe
  • 319
  • 1
  • 8
  • 2
    Re "*If the source code encoding is ISO8859-1*", ...then it can't possibly contain `char str[MSG_LEN] = "配置文件中包含中文";`, since no Chinese characters are found in the ISO8859-1 character set. – ikegami Sep 26 '19 at 04:58
  • *"Is there any elegant way to deal with the chinese character"* Ensure the file is UTF-8. *"I can not ensure the source file encoding."* Yeah, you can. – Max Vollmer Sep 26 '19 at 05:00
  • @MaxVollmer My collegue always change the source code encoding,It makes me crazy. – J.Doe Sep 26 '19 at 05:05
  • [This seems useful](https://forum.juce.com/t/embedding-unicode-string-literals-in-your-cpp-files/12600). –  Sep 26 '19 at 05:07
  • 1
    Talk to your colleague then and find a way to make sure this doesn't happen anymore. Agreeing on consistent file encoding in projects isn't rocket science. Just make sure everyone on the team sets the encoding of the software they use to what you have agreed on as team. – Max Vollmer Sep 26 '19 at 05:11

3 Answers3

2

Is there an elegant way of representing characters not found in ASCII using just ASCII characters? No.

But it is possible to do so in an inelegant way.

char str[MSG_LEN] = "\xE9\x85\x8D\xE7\xBD\xAE\xE6\x96\x87\xE4\xBB\xB6\xE4\xB8\xAD\xE5\x8C\x85\xE5\x90\xAB\xE4\xB8\xAD\xE6\x96\x87";

Of course, just like your original program, this assumes the person viewing the file names (e.g. using ls) has a locale based on UTF-8.

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • I can not convert chinese character in run time.The example above just show the way I deal with the chinse character. – J.Doe Sep 26 '19 at 05:06
  • 2
    @J.Doe, I don't know what you're talking about. I didn't say anything about doing any conversion at runtime. The above is 100% equivalent to the program you posted as-is. – ikegami Sep 26 '19 at 05:07
  • the str is not what I write in the example above, I had to deal with the user input chinese character in my code. – J.Doe Sep 26 '19 at 05:52
  • 1
    @J.Doe The encoding of the file has no effect on user input. The only time the encoding of the file matters is when you have Chinese in strings literals. And when you do, you can use my solution. – ikegami Sep 26 '19 at 05:55
  • I will try later,I have to read something from database or user input,which would contain chinese character and joint them together(some fixed chinese character). – J.Doe Sep 26 '19 at 06:17
  • thanks,it works.How did you get str array with \xE9\x85……?I got the same with @phuclv char str[] = u8"\u914D\u7F6E\u6587\u4EF6\u4E2D\u5305\u542B\u4E2D\u6587"; in a website convert character to utf8 code. – J.Doe Sep 26 '19 at 09:00
0

To get a UTF-8 string reliably and elegantly regardless of the source file encoding you can add the u8 prefix

char str[] = u8"\u914D\u7F6E\u6587\u4EF6\u4E2D\u5305\u542B\u4E2D\u6587";

char str[] can be changed to char8_t str[] if you use C++20

This way you don't need to find the encoded UTF-8 bytes, and when you need another encoding like UTF-16 or UTF-32 just change the type and the prefix (u8 to u or U, and char[] to auto). The compiler will automatically convert the encoding to guarantee the correct byte sequence in memory

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • what's wrong with this. If you don't understand the `u8` prefix the it's your problem, not that the answer is wrong – phuclv Dec 28 '20 at 15:58
-1

Instead of using wchar_t,I just using char array,Is this correct?

I'd say no. The default character set and encoding for char is implementation defined (could be EBCDIC or ASCII or UTF-8 or whatever the source file happened to use or anything else) and the default character set and encoding for wchar_t is also implementation defined (could be UTF-16LE or ...).

If you need the output to be UTF-8; then (especially for portable code) you need to ignore the random default nonsense the C compiler felt like. You should also avoid using char because whether that's signed or unsigned is implementation defined, avoid using unsigned char because there's no actual guarantee that it's 8 bits, and avoid using wchar_t (because its size is implementation defined)

Specifically (for UTF-8), I'd use uint8_t, like:

uint8_t str[] = 0xE9, 0x85, 0x8D, 0xE7, 0xBD, 0xAE, 0xE6, 0x96, 0x87, 0xE4, 0xBB, 0xB6,
                0xE4, 0xB8, 0xAD, 0xE5, 0x8C, 0x85, 0xE5, 0x90, 0xAB, 0xE4, 0xB8, 0xAD,
                0xE6, 0x96, 0x87, 0x00;

Of course if you want the file to contain CNS-11643 (or anything else) you could do that too. You just need to find a suitable type, and find the "array of numbers of that type" (e.g. possibly by using a utility like hexdump on a text file that uses the desired character set and encoding).

Brendan
  • 35,656
  • 2
  • 39
  • 66
  • I will deal with user input chinese character ,so I can't define a str because I don't know the chinese character will be. – J.Doe Sep 26 '19 at 05:51