0

I am currently working on char[] on UTF8 lib, there is the 1-bite char, For an example :

L:15 ||This is a demo.||
OriginalHex ||5468697320697320612064656d6f2e||
lower ||this is a demo.||
upper ||THIS IS A DEMO.||

and the 2+ bites char, same example with different char:

L:20||Thïs îs à démö.||
OriginalHex ||5468c3af7320c3ae7320c3a02064c3a96dc3b62e||
lower ||thïs îs à démö.||
upper ||THïS îS à DéMö.||

The char 'ï', 'î', 'à', 'é' and 'ö' use 2 bytes,

It seems that the C functions int tolower(int) and int toupper(int) dont work for 2+ bytes char,

Are there some functions to force them in lower/upper ? Is there function that converts the "specific" latin char to 1byte char? For an example, if the input is 'ï' or 'î', the output should be 'i' .

How to lower/upper on non-latin alphabet?

Cyrillic

L:19 ||Привет мир||
OriginalHex ||d09fd180d0b8d0b2d0b5d18220d0bcd0b8d180||
lower ||Привет мир||
upper ||Привет мир||

Greek

L:26 ||Γειά σου Κόσμε||
OriginalHex ||ce93ceb5ceb9ceac20cf83cebfcf8520ce9acf8ccf83cebcceb5||
lower ||Γειά σου Κόσμε||
upper ||Γειά σου Κόσμε||

Is there a solution without using wchar_t [] ?

SMPP_lover
  • 11
  • 1
  • 1
    Does this answer your question? [How to uppercase/lowercase UTF-8 characters in C++?](https://stackoverflow.com/questions/36897781/how-to-uppercase-lowercase-utf-8-characters-in-c) Or maybe [How do I convert a UTF-8 string to upper case?](https://stackoverflow.com/questions/9929147/how-do-i-convert-a-utf-8-string-to-upper-case). – Joachim Sauer May 06 '21 at 08:28
  • 2
    The proper Unicode solution is to fetch the mappings from the Unicode database; for example, your greek Kappa is [U+039A](https://www.fileformat.info/info/unicode/char/039a/index.htm) which maps to [U+03BA](https://www.fileformat.info/info/unicode/char/03ba/index.htm) in the (oddly Java-labelled) lowercase mapping. You should probably also be aware of [Unicode normalization.](https://en.wikipedia.org/wiki/Unicode_equivalence) – tripleee May 06 '21 at 08:31

1 Answers1

0

How to lower/upper on non-latin alphabet?

There is basically a constant mapping defined in unicode that maps lowercase to uppercase characters. No tricks, no calculations - just a big map. Ex. in libunistring toupper.h or in glibc i18n_ctype.

Are there some functions to force them in lower/upper ?

There is no forcing and some points have no case.

Is there function that converts the "specific" latin char to 1byte char?

Glibc has a constant mapping for that.

KamilCuk
  • 120,984
  • 8
  • 59
  • 111