How to fix locale?

Question

Add ru_RU.CP1251 locale (on debian uncomment ru_RU.CP1251 in /etc/locale.gen and run sudo locale-gen) and compile the following program with gcc -fexec-charset=cp1251 test.c (input file is in UTF-8). The result is empty. Just letter 'я' is wrong. Other letters are determined either lowercase or uppercase just fine.

#include <locale.h>
#include <ctype.h>
#include <stdio.h>
int main (void)
{
  setlocale(LC_ALL, "ru_RU.CP1251");
  char c = 'я';
  int i;
  char z;
  for (i = 7; i >= 0; i--) {
    z = 1 << i;
    if ((z & c) == z) printf("1"); else printf("0");
  }
  printf("\n");

  if (islower(c))
    printf("lowercase\n");
  if (isupper(c))
    printf("uppercase\n");
  return 0;
}

Why neither islower() nor isupper() work on letter я?

Is `char` large enough to store `я`? The prototype of `islower()` suggests that `int` would be a better choice. — mouviciel, Oct 28 '16 at 05:55
`islower` and `isupper` doesn't work as expected with multibyte characters, take a look to [iswupper](http://pubs.opengroup.org/onlinepubs/9699919799/functions/iswupper.html) and [iswlower](http://pubs.opengroup.org/onlinepubs/9699919799/functions/iswlower.html) — David Ranieri, Oct 28 '16 at 05:59
@KeineLust I use cp1251, which is 8-bit encoding. I do not need wide characters. Try letter 'ю' - it works just fine. Only letter 'я' does not work. And I need to fix that. — Igor Liferenko, Oct 28 '16 at 06:03
@IgorLiferenko, have you tried `ru_RU.UTF-8` ? If input is utf-8 it's no sense to try to show it as cp-1251 if you don't convert first the character codes between codesets. — Luis Colorado, Nov 03 '16 at 06:37
@mouviciel: the prototype for the regular `isxxxxx()` functions has `int` as the argument type because you can pass any valid 'character coded as `unsigned char`' value or EOF, which means that one of the `char` types cannot be used as the formal argument type (because that could not accept a wide enough range of values). — Jonathan Leffler, Nov 03 '16 at 14:28
@JonathanLeffler No, `int` in the argument type is a convenience feature - in C `char` is promoted to `int`, so you do not have to do explicit typecasting. Read a thorough explanation here https://sourceware.org/bugzilla/show_bug.cgi?id=20639#c7 Besides, one never needs to pass `EOF` to any `isxxxx()` function, and use `unsigned char` instead of just `char`. — Igor Liferenko, Nov 07 '16 at 05:28
@IgorLiferenko: If the type of plain `char` is equivalent to `signed char` and if the character has the high bit set (e.g. an accented character in an 8-bit code set such as ISO 8859-15), then the plain `char` is promoted to a negative `int` and that negative value is an invalid argument to the function (macro). — Jonathan Leffler, Nov 07 '16 at 05:31
@JonathanLeffler that `int` is converted to `char` inside library functions anyway — Igor Liferenko, Nov 07 '16 at 09:42
Avoid those obsolete locales and use Unicode instead. It's much simpler and compatible with all computers — phuclv, Nov 08 '16 at 02:31
@LưuVĩnhPhúc Thank you. I use Unicode. But there **are** rare cases when one needs CP1251 encoding. But this is not the problem in question. Read the question *carefully* please. — Igor Liferenko, Nov 08 '16 at 02:33

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

1

Igor, if your file is UTF-8 it's of no sense to try to use code page 1251, as it has nothing in common with utf-8 encoding. Just use locale ru_RU.UTF-8 and you'll be able to display your file without any problem. Or, if you insist on using ru_RU.CP1251, you'll need to first convert your file from utf-8 encoding to cp1251 (you can use the iconv(1) utility for that)

iconv --from-code=utf-8 --to-code=cp1251 your_file.txt > your_converted_file.txt

On other side, the --fexec-charset=cp1251 only affects the characters used on the executable, but you have not specified the input charset to use in string literals in your source code. Probably, the compiler is determining that from the environment (which you have set in your LANG or LC_CHARSET environment variables)

Only once you control exactly what locales are used at each stage, you'll get coherent results.

The main reason an effort is being made to switch all countries to a common charset (UTF) is exactly to not have to deal with all these locale settings at each stage.

If you deal always with documents encoded in CP1251, you'll need to use that encoding for everything on your computer, but when you receive some document encoded in utf-8, then you'll have to convert it to be able to see it right.

I mostly recommend you to switch to utf-8, as it's an encoding that has support for all countries character sets, but at this moment, that decision is only yours.

NOTE

On debian linux:

$ sed 's/^/    /' pru-$$.c 
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <locale.h>

#define P(f,v) printf(#f"(%d /* '%c' */) => %d\n", (v), (v), f(v))
#define Q(v) do{P(isupper,(v));P(islower,(v));}while(0)

int main()
{
    setlocale(LC_ALL, "");
    Q(0xff);
}

Compiled with

$ make pru-$$
cc    pru-1342.c   -o pru-1342

execution with ru_RU.CP1251 locale

$ locale | sed 's/^/    /'
LANG=ru_RU.CP1251
LANGUAGE=
LC_CTYPE="ru_RU.CP1251"
LC_NUMERIC="ru_RU.CP1251"
LC_TIME="ru_RU.CP1251"
LC_COLLATE="ru_RU.CP1251"
LC_MONETARY="ru_RU.CP1251"
LC_MESSAGES="ru_RU.CP1251"
LC_PAPER="ru_RU.CP1251"
LC_NAME="ru_RU.CP1251"
LC_ADDRESS="ru_RU.CP1251"
LC_TELEPHONE="ru_RU.CP1251"
LC_MEASUREMENT="ru_RU.CP1251"
LC_IDENTIFICATION="ru_RU.CP1251"
LC_ALL=

$ pru-$$
isupper(255 /* 'я' */) => 0
islower(255 /* 'я' */) => 512

So, glibc is not faulty, the fault is in your code.

edited Jun 20 '20 at 09:12

Community

1
1

answered Nov 03 '16 at 06:43

Luis Colorado

10,974
1
16
31

Well, it depends on your software distribution, as you should change your system configuration to allow for the locale you want if not already configured by your system administrator. If you have it included, you can change locale just by adjusting the environment (variables `LC_*` and `LANG`, just dig into `locale(1)` tool) Locales are a per user (or per session) configuration issue. You first need to have them included in your locale library and then you have to adjust yours with environment variables. Compiling locale info is a system dependent issue, so you need to dig in your system. – Luis Colorado Nov 07 '16 at 07:33
utf-8 is an 8-bit locale, believe me or not, you use 8bit characters to deal with. By the way, you don't specify in your question that fixing locale is what you wan. Locale is managed by several, incompatible packages, depending on the software distribution you have (In macintosh, you have software packages derived from BSD, in linux you normally have the gettext package, both with different tools and configuration info) Specify what do you want exactly and I'll be able to specify more. – Luis Colorado Nov 07 '16 at 07:39
@IgorLiferenko, yes, because adding options to compilker only affects your code, and not the code in the libraries that do actual input/output. But I'm not going to discuss that more. – Luis Colorado Nov 07 '16 at 10:17
Either case, adding `-fexec-charset=cp1251` doesn't make your code accept `utf-8` as native.... either. You don't understand that utf-8 has nothing to do with cp1251, and your file is in that encoding.... you have to convert your file first, or you will not be able to do anything useful with it. – Luis Colorado Nov 07 '16 at 10:18
posting a bug report does not mean you are right in your aseveration. Do you know exactly what does mean 0xff in your *any* locale? is it meaninful to ask for uppercase of anything? I think you are completely wrong anyway. Go to parliament and file a demand. – Luis Colorado Nov 08 '16 at 06:00
for `islower()` to work in the current locale you have to `setlocale(LC_ALL, "");` *before* calling `islower()`. Just tested with `"es_ES.ISO8859-1"`, and `islower('ÿ')` (which is the iso-8859-1 value for `0xff`) returns `true`. But if you don't set the locale first it returns `false` (as specified for the `C` locale, in which only the ASCII lower case letters are lowercase) – Luis Colorado Nov 08 '16 at 06:26
Or you probably are calling them incorrectly. On my system `islower('ÿ') => 1`. Have you initialized the locale for your program with a call to `setlocale(3)` ??? (And no need to use `-fexec-charset` option on compiling) – Luis Colorado Nov 08 '16 at 06:29
no.... you can teach me how to use locales, but the only one that is teaching here is the one that has the problem using locales.... see you next time. – Luis Colorado Nov 08 '16 at 06:31
`-fexec-charset` is used when you have a different locale in your compilation environment than it is in the execution one. It's only used to translate character and string literals you have used in your compilation unit to the target locale. But it never makes library routines to behave differently, so it is of no use here, as your compilation locale is the same as your execution environment. – Luis Colorado Nov 08 '16 at 07:36
Not. the compiler option affects **only** your string and char literals in your source code. Data input to your program is processed by library functions and **never** is affected by that option. Your assumption is false, so we cannot conclude (by implication or else) that your locale is *any* locale. – Luis Colorado Nov 08 '16 at 07:59

score 1 · Answer 2 · edited May 23 '17 at 12:07

The first comment of Jonathan Leffler to OP is true. isxxx() (and iswxxx()) functions are required to handle EOF (WEOF) argument (probably to be fool-proof). This is why int was chosen as the argument type. When we pass argument of type char or character literal, it is promoted to int (preserving the sign). And because by default char type and character literals are signed in gcc, 0xFF becomes -1, which is by unhappy coincidence the value of EOF.

Therefore always do explicit typecasting when passing parameters of type char (and character literals with code 0xFF) to functions, using int argument type (don't count on the unsignedness of char, because it is implementation-defined). Typecasting may be either done via (unsigned char), or via (uint8_t), which is less to type (you must include stdint.h).

See also https://sourceware.org/bugzilla/show_bug.cgi?id=20792 and Why passing char as parameter to islower() does not work correctly?

score 1 · Accepted Answer · answered Nov 11 '16 at 08:14

1

The answer is that the encoding for the lower case version of that character in CP 1251 is decimal 255, and islower() and isupper() for your implementation do not accept or return that value (which is often interpreted as EOF).

You need to track down the source code for the runtime library to see what it does and why.

The solution is to write your own implementations, or wrap the ones you have. Personally, I never use these functions directly because of the many gotchas.

answered Nov 11 '16 at 08:14

david.pfx

10,520
3
30
63

Why argument type of `putchar()`, `fputc()` and `putc()` is not `char`, but argument type of `putwchar()`, `fputwc()` and `putwc()` is `wchar_t`? Also, why in example from `man mbstowcs` the variable of type `wchar_t` is passed to `iswlower()` ? This contradicts to the fact that `iswlower()` takes `wint_t`. Is the example wrong? BTW, what wrappers do you use? Did you see this question? http://stackoverflow.com/questions/40601645/how-to-change-wchar-h-to-make-wchar-t-the-same-type-as-wint-t – Igor Liferenko Nov 16 '16 at 02:36
@IgorLiferenko: I always write my own wrappers. For the others, you should ask a new question. – david.pfx Nov 16 '16 at 03:00
I asked the question here: http://stackoverflow.com/questions/40626189/how-to-wrap-glibc-library-functions-to-automatically-use-unsigned-char-and-wc – Igor Liferenko Nov 16 '16 at 07:20

How to fix locale?

3 Answers3

NOTE