C language sizeof, strlen and strncpy for chinese words

Question

I haven't long time in touch with C language. I have some questions related to chinese words and strncpy.

char* testString = "你好嗎?"
sizeof(testString) => it prints out 4.
strlen(testString) => it prints out 10.

When i want to copy to another char array, i have some issue.

char msgArray[7]; /* This is just an example. Due to some limitation, we have limited the buffer size. */

If i want to copy the data, i need to check

if (sizeof(testString) < sizeof(msgArray)) {
    strncopy(msgArray, testString, sizeof(msgArray));
}

It will have problem. The result is it will only copy a partial data.

Actually it should have compared with

if (strlen(testString) < sizeof(msgArray)) {

}
else {
   printf("too long");
}

But i don't understand why it happened.

If i want to define to limit the characters count (including unicode (eg. chinese characters), how can i achieve to define the array? I think i can't use the char[] array.

Thanks a lot for all the responses.

My workaround solution: I finally decide to cut the strings to meet the limited bytes.

These are unicode strings, you need to use wcslen, etc. sizeof is not doing what you think, it's the size of a pointer. — simonzack, Nov 10 '15 at 07:47
`testString` is a pointer. The size of a pointer is not relevant for copying strings. You need to find the length of the array it might point to. — juanchopanza, Nov 10 '15 at 07:48
@simonzack: `wcslen` is for `wchar_t` array, not `char` array. And when it comes to `strcpy`, the underlying encoding does not matter. It is byte-per-byte copy. — Siyuan Ren, Nov 10 '15 at 07:53
@SiyuanRen Yes but what he has is essentially a `wchar_t` array. — simonzack, Nov 10 '15 at 07:54
@simonzack: No, it is not a `wchar_t` array. Depends on his compiler, it may be encoded in either utf-8 or gbk. One must use `L"你好嗎?"` to declare a `wchar_t` array literal. — Siyuan Ren, Nov 10 '15 at 07:56
`strncpy` is a terrible function at the best of times, it should be avoided — M.M, Nov 10 '15 at 08:04
the code needs to `#include then set the 'locale' to the proper value, then use the wide char functions, like `wsclen()` which will return the length (in wide characters) of a wide char array — user3629249, Nov 12 '15 at 15:36

haccks · Answer 1 · 2015-11-10T08:00:17.957

5

Pointers are not arrays. testString is a pointer and therefore, sizeof(testString) will give the size of pointer instead of the string it points to.

strlen works differently and only for null terminated char arrays and string literals. It gives the length of the string preceding the null character.

edited Nov 10 '15 at 08:00

answered Nov 10 '15 at 07:50

haccks

104,019
25
176
264

score 2 · Answer 2 · answered Nov 10 '15 at 08:14

The behaviour of char* testString = "你好嗎?" depends on the compiler. One option would be to investigate what your compiler is doing by outputting individual characters via %d . It might be generating a UTF-8 literal.

In the C11 standard you may write one of the following:

char const *testString = u8"你好嗎?";   // UTF-8 encoding

or

wchar_t const *testString = u"你好嗎?"; // UTF-16 or UCS-4 encoding

With these strings, there is no way in Standard C to work with Unicode characters. You can only work with code points and/or C characters. strlen or wcslen respectively will give the number of C characters in the string but this might not correspond to how many glyphs are displayed.

If your compiler does not comply with the latest standard (i.e. it gives errors for the above lines) then to write portable code you will need to only use ASCII in your sourcefile.

To embed unicode in string literals you could use '\xNN' with UTF-8 hex codes.

In both cases your best bet is probably to use a third-party Unicode library such as ICU.

For the second part of the question, I'll assume you are using UTF-8. The result of strlen(testString) + 1 is the number of characters you need to copy. You say you are stuck with a fixed-size 7-byte buffer. If that is true then the code could be:

char buf[7];

if ( strlen(testString) > 6 )
    exit(1);   // or jump to some other error handling

strcpy(buf, testString);

The strncpy should be avoided because it does not null-terminate its buffer in some circumstances; you can always replace it with strcpy or snprintf.

score 1 · Answer 3 · answered Nov 10 '15 at 07:58

normally you can use wchar_t to represent UTF characters (non-English characters), and each character may need 2 or 4 bytes. And if you really want to count number of characters in a quick way, use uint32_t(unsigned int) instead of char/wchar_t because UTF32 is guarantee each character (including non-English character) will have the same size of 4 bytes.

sizeof(testString) will only give you the size of a pointer itself which is 4 in 32bit system and 8 in 64bit system.

use wcslen to get the string len if you're using wchar_t; if you're using uint32_t, you need to write your own strlen function similar as follow:

size_t strlenU32(const uint32_t *s) {
    const uint32_t *u = s;
    while (*u) u++;
    return u - s;
}

score -1 · Answer 4 · answered Nov 10 '15 at 08:01

-1

I am not pro, but you may try somthing like this:

char* testString = "你好嗎?\0"; //null-terminating char at the end
int arr_len = 0;
while(testString[arr_len])
arr_len++;

As result, it returns 10, whih is number of array field, so if you multiply it by the size of single byte, you will get actual length of the string.

Regards, Paweł

answered Nov 10 '15 at 08:01

AdamsP

17
4

1

There is no need of adding `'\0'` there . That's is already a string literal – ameyCU Nov 10 '15 at 08:06

C language sizeof, strlen and strncpy for chinese words

4 Answers4