5

I ask a question to know the usage of "strxfrm" in C.

I know the function is to transform a string according to current locale configuration.

but I don't know what "transform" is, and how this function transforms.

For example, I tried a code like below in macOS:

#include <stdio.h>
#include <string.h>
#include <locale.h>

int main(int argc, char * argv[])
{
    char str1[512] = { 0x68, 0x6c, 0x61, 0x76, 0x61, 0x00 }; //"hlava";
    char str2[512] = { 0xc4, 0x8d, 0xc3, 0xad, 0xc5, 0xa1, 0x6e, 0xc3, 0xad, 0x6b, 0x00 }; //"číšník";
    char xfm1[512] = { '\0', };
    char xfm2[512] = { '\0', };
    char * result = NULL;
    size_t lxfm1 = 0;
    size_t lxfm2 = 0;

    result = setlocale(LC_ALL, "en_US.UTF-8");
    lxfm1 = strxfrm(xfm1, str1, sizeof xfm1);
    lxfm2 = strxfrm(xfm2, str2, sizeof xfm2);
    printf("<en-US>\n");
    printf("setlocale = \"%s\"\n", (result == NULL) ? "NULL" : result);
    printf("str1: \"%s\" --> \"%s\"\n", str1, xfm1);
    printf("str2: \"%s\" --> \"%s\"\n", str2, xfm2);
    printf("strcmp(str1, str2) = %d\n", strcmp(str1, str2));
    printf("strcmp(xfm1, xfm2) = %d\n", strcmp(xfm1, xfm2));
    printf("strcoll(xfm1, xfm2) = %d\n", strcoll(str1, str2));
    printf("returns of strxfrm: %zu / %zu\n", lxfm1, lxfm2);

    result = setlocale(LC_ALL, "cs_CZ.UTF-8");
    lxfm1 = strxfrm(xfm1, str1, sizeof xfm1);
    lxfm2 = strxfrm(xfm2, str2, sizeof xfm2);
    printf("<cs-CZ>\n");
    printf("setlocale = \"%s\"\n", result);
    printf("str1: \"%s\" --> \"%s\"\n", str1, xfm1);
    printf("str2: \"%s\" --> \"%s\"\n", str2, xfm2);
    printf("strcmp(str1, str2) = %d\n", strcmp(str1, str2));
    printf("strcmp(xfm1, xfm2) = %d\n", strcmp(xfm1, xfm2));
    printf("strcoll(xfm1, xfm2) = %d\n", strcoll(str1, str2));
    printf("returns of strxfrm: %zu / %zu\n", lxfm1, lxfm2);

    return 0;
}

I expected that the result of "strcmp(xfm1, xfm2)" would be positive integer because the character 'č' precedes 'h' in czech language.

However, the result is...

<en-US>
setlocale = "en_US.UTF-8"
str1: "hlava" --> "001Z001^001S001h001S0000001Z001^001S001h001S"
str2: "číšník" --> "0042003_0042001`003_001]0000008?003_009S001`003_001]"
strcmp(str1, str2) = -92
strcmp(xfm1, xfm2) = -3
strcoll(xfm1, xfm2) = -152
returns of strxfrm: 44 / 52
<cs-CZ>
setlocale = "cs_CZ.UTF-8"
str1: "hlava" --> "001Z001^001S001h001S0000001Z001^001S001h001S"
str2: "číšník" --> "0042003_0042001`003_001]0000008?003_009S001`003_001]"
strcmp(str1, str2) = -92
strcmp(xfm1, xfm2) = -3
strcoll(xfm1, xfm2) = -152
returns of strxfrm: 44 / 52

Am I misunderstanding about this function 'strxfrm'? Actually, I don't know the meaning of 'transform' clearly even now.

please let me know the right usage and purpose of the function.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
Luciano Jeong
  • 325
  • 1
  • 10
  • How does `"číšník"` work? I understand these characters are somehow encoded in ASCII in your source file. Can you post hexadecimal encoding of `"číšník"` string? – KamilCuk Aug 21 '18 at 07:01
  • On Ubuntu, a `strcmp()` of the transformed strings gives me a positive number, btw. Something's probably screwy in your environment. Have you done the obvious checks like making sure your source file is actually encoded using UTF-8? – Shawn Aug 21 '18 at 07:03
  • But, anyway, if you change `LC_COLLATE` from `en_US.UTF-8` to `cs_CZ.UTF-8`, it's stillUTF-8. UTF-8 (no matter cs_CZ or en_US or any) can represent all czech characters, so nothing should change as observed. The string `číšník` is `{ 0xc4,0x8d,0xc3,0xad,0xc5,0xa1,0x6e,0xc3,0xad,0x6b,0x00 }` in UTF-8 and in any.UTF-8 should *cmp the same. – KamilCuk Aug 21 '18 at 07:07
  • Thank you. then, Can I exact example of usage of strxfrm using standard ASCII text? I want to see the difference of strcmp according its locale configuration. – Luciano Jeong Aug 21 '18 at 07:19
  • Please don't post pictures of text terminals. Post text as text. – n. m. could be an AI Aug 21 '18 at 07:43
  • This program prints positive value for `strcmp(xfm1, xfm2)` on cygwin and on ubuntu, as it should. Perhaps a Mac OS problem. Try checking that `setlocale` succeeds. – n. m. could be an AI Aug 21 '18 at 07:56
  • @KamilCuk to prove yourself wrong, create a UTF-8 encoded file with two lines. One line contains a single "e" letter and the other line contains a single "ä" letter. Then sort according to the German locale, and according to the Danish locale (`LC_ALL=de_DE.UTF-8 sort file.txt`, and `LC_ALL=dk_DK.UTF-8 sort file.txt`). Observe the difference. – n. m. could be an AI Aug 21 '18 at 08:13
  • @n.m. the return of setlocale is "en_US.UTF-8" and "cs_CZ.UTF-8", so they are not NULL. – Luciano Jeong Aug 21 '18 at 08:22
  • Can you try the test with de and dk locales from my previous comment? Is it working for you? – n. m. could be an AI Aug 21 '18 at 08:24
  • I modified the source code and the result of this question. please review everyone in this article. – Luciano Jeong Aug 21 '18 at 08:30
  • Can you also test using `strcoll` instead of `strxfrm` and `strcmp`? I hear there could be a bug in mac os/freebsd implementation of `strxfrm`. – n. m. could be an AI Aug 21 '18 at 08:31
  • @n.m. I modified the source code and the result, using strcoll. read this question one more time. the results are still same in between en-US and cs-CZ, And strcoll also returns negative integer. – Luciano Jeong Aug 21 '18 at 08:40
  • 1
    It looks like there's indeed a bug in Mac OS X locales implementation. Google *"mac os x" collation bug*. Your code is correct, your OS is broken. – n. m. could be an AI Aug 21 '18 at 08:41
  • @n.m. thanks. Maybe I got the reason - it is just a bug of OS. I "googled" the keyword you mentioned. and I got a solution by modifying some files of "/usr/share/locale/"... then should I modify each file of all languages and locales? (e.g. swedish, czech, turkey, germany, ...) – Luciano Jeong Aug 21 '18 at 08:50
  • I'm not sure you want to modify your locale definitions. Perhaps a more sensible workaround is to use a third-party Unicode library such as ICU. – n. m. could be an AI Aug 21 '18 at 08:54

1 Answers1

4

Your usage of strxfrm is correct. The problem lies in the Mac OS X (and FreeBSD) locales implementation. It simply doesn't work properly with UTF-8. It's apparently a long standing bug/defect/inconsistency/quirk/whatever in the version of libc these operating systems use.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243