21

Let's imagine I have a UTF-8 encoded std::string containing the following:

óó

and I'd like to convert it to the following:

ÓÓ

Ideally I want the uppercase/lowercase approach I'm using to be generic across all of UTF-8. If that's even possible.

The original byte sequence in the string is 0xc3b3c3b3 (two bytes per character, and two instances of ó) and I'd like the output to be 0xc393c393 (two instances of Ó). There are some examples on StackOverflow but they use wide character strings, and other answers say you shouldn't be using wide character strings for UTF-8. It also appears that this problem can be very "tricky" in that the output might be dependent upon the user's locale.

I was expecting to just use something like std::toupper(), but the usage is really unclear to me because it seems like I'm not just converting one character at a time but an entire string. Also, this Ideone example I put together seems to show that toupper() of 0xc3b3 is just 0xc3b3, which is an unexpected result. Calling setlocale to either UTF-8 or ISO8859-1 doesn't appear to change the outcome.

I'd love some guidance if you could shed some light on either what I'm doing wrong or why my question/premise is faulty!

Community
  • 1
  • 1
aardvarkk
  • 14,955
  • 7
  • 67
  • 96

6 Answers6

16

There is no standard way to do Unicode case conversion in C++. There are ways that work on some C++ implementations, but the standard doesn't require them to.

If you want guaranteed Unicode case conversion, you will need to use a library like ICU or Boost.Locale (aka: ICU with a more C++-like interface).

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • Can you think of an instance in which the `wstring` approach mentioned in the other answer wouldn't work? – aardvarkk Apr 27 '16 at 19:00
  • 4
    @aardvarkk: If the implementation doesn't implement `en_US.UTF-8`. Also if the implementation doesn't use a Unicode format for its `wchar_t`. Nothing in the standard *guarantees* either of these. Wide-character strings are just as implementation-defined as narrow-character strings. Also, the other answer didn't use UTF-8, which was a part of your requirement. – Nicol Bolas Apr 27 '16 at 19:02
  • In Red Hat Linux the towupper() and towlower() looks to work in UTF16 but I have failed to make the same code work in MSVC Win10 and they do not exist in Java/Android NDK C. Looking into the output of the Red Hat Linux the towupper() and towlower() they look elder than some of the UTF Lwr and Upr characters (rare chars) and a few conversions can be doubted, but looks to work in Red Hat Linux (converting UTF8 to UTF16, make Lwr and then convert back. Be careful with the locale settings. – Jan Bergström Oct 05 '21 at 13:40
  • @JanBergström: "*In Red Hat Linux the towupper() and towlower() looks to work in UTF16*" That seems unlikely since UTF-16 is a multi-byte encoding, and `towupper/lower` only takes a single `wchar_t`. Now, on most Linux systems, `wchar_t` is actually 32-bits, and wide-strings tend to be UTF-32, not UTF-16, which is why this probably works for you. – Nicol Bolas Oct 05 '21 at 13:41
  • OK I accept you are right 32-bit UTF32, I use UTF8 in general. I am doing tests together with someone else that contacted med about the "This code is a carefully tested UTF8 case conversion/case insensitive cmp."-answer. We used the towupper/lower functions in Redhat and MSVS (and lately Debian and MacOS) to chase bugs and are very close to extinguish them all. There are some differences (MSVS 101 chars of 2500) that are different with MSVS towupper/lower functions. Some implementations covers not all Lwr/Upr case pairs of UTF, that is the main of the differences. – Jan Bergström Oct 06 '21 at 14:48
  • Redhat and Debian towupper/lower includes converting from 2 byte to 3 byte UTF8 chars but MSVS and MacOS is not. In UTF32 it is quite possible without problems but converting it to UTF8 after towupper/lower in UTF32 there might be some sync problems when the byte string is of different length. – Jan Bergström Oct 06 '21 at 14:49
  • @JanBergström: Also, `towupper/lower` cannot work character-by-character because sometimes the lowercase version of a character is actually *two* characters. So the very API of these functions in Unicode is just wrong-headed. – Nicol Bolas Oct 06 '21 at 15:11
  • Well in UTF8 there are 21 pairs where the German double-s - beta sign is I believe the most painful that can't be converted due to different byte length, see in the answer of "This code is a carefully tested UTF8 case conversion/case insensitive cmp.". I am not really having a deep knowledge of UTF32 but it should work. The Java/Android NDK do not support anything but UTF8 and I make portable code including it so I can't use the towupper/lower way of handling this topic. – Jan Bergström Oct 06 '21 at 15:30
  • @JanBergström: You seem to be jumping back and forth between Unicode and UTF-8. They are separate things: UTF-8 is a way for Unicode codepoints to be encoded, but case conversion is a *purely* Unicode operation. You don't do case-conversion in UTF-8; you convert some UTF-8 data (partially) into codepoints, case-convert the codepoints, and then convert them back into UTF-8 in a separate string. The case-conversion logic ought to be separate from the encoding logic. – Nicol Bolas Oct 06 '21 at 15:32
  • @Nicol Bolas: Well the topic is UTF8 and this answer is "There is no standard way to do Unicode case conversion in C++". The question is "How to uppercase/lowercase UTF-8 characters in C++?" and we would like to find a way to make case insensitive string search operations. And the Java/Android NDK C only supports basic char strings in practice UTF8 encoding. But in some OS there is the opportunity to convert to UTF32 and use Lwr/Upr and convert back to UTF8. I just point out that there are a few side consequences doing that, like sync of string lengths? There is a need and we should be aware. – Jan Bergström Oct 06 '21 at 20:59
  • @Nicol Bolas: I check the possibilities / work on a general UTF solution and think it may be possible to handle sync problems with UTF8 strings in strstr () and strcmp () cases even when converting between characters with different byte lengths in UTF8. A clearly better general solution, I hope I get to it, a little exciting. Your comment is generally very accurate. – Jan Bergström Oct 10 '21 at 03:39
  • @Nicol Bolas: I tried to do something relating to your comment using UTF refs instead in the Lwr/Upr conversions and converting from/to UTF8 making it more political proper. The conversions are written by a program reading the UTF character specs and by it very bug free. Only downside is that it behaves odd by illegal chars above U+10FFFF (\xf4\xXX\xXX\xXX). Please try it. https://www.alphabet.se/download/UtfConv.c – Jan Bergström Oct 13 '21 at 00:58
8

This code is a carefully tested UTF8 case conversion/case insensitive cmp.

It is supposed to be correct (if any bugs are found please tell).

This function covering the case sensitive character sets in the UTF8 and how to use it for cmp.

unsigned char* StrToUprExt(unsigned char* pString) (separate answer below, answer space)
unsigned char* StrToLwrExt(unsigned char* pString)
int StrnCiCmp(const char* s1, const char* s2, size_t ztCount)
int StrCiCmp(const char* s1, const char* s2)
char* StrCiStr(const char* s1, const char* s2)

These characters are to be converted:

ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİIJĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸŹŻŽƁƂƄƆƇƊƋƎƏƐƑƓƔƖƗƘƜƝƠƢƤƧƩƬƮƯƱƲƳƵƷƸƼDŽDžLJLjNJNjǍǏǑǓǕǗǙǛǞǠǢǤǦǨǪǬǮDZDzǴǶǷǸǺǼǾȀȂȄȆȈȊȌȎȐȒȔȖȘȚȜȞȠȢȤȦȨȪȬȮȰȲȺȻȽȾɁɃɄɅɆɈɊɌɎͰͲͶͿΆΈΉΊΌΎΏΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩΪΫϏϘϚϜϞϠϢϤϦϨϪϬϮϴϷϹϺϽϾϿЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯѠѢѤѦѨѪѬѮѰѲѴѶѸѺѼѾҀҊҌҎҐҒҔҖҘҚҜҞҠҢҤҦҨҪҬҮҰҲҴҶҸҺҼҾӀӁӃӅӇӉӋӍӐӒӔӖӘӚӜӞӠӢӤӦӨӪӬӮӰӲӴӶӸӺӼӾԀԂԄԆԈԊԌԎԐԒԔԖԘԚԜԞԠԢԤԦԨԪԬԮԱԲԳԴԵԶԷԸԹԺԻԼԽԾԿՀՁՂՃՄՅՆՇՈՉՊՋՌՍՎՏՐՑՒՓՔՕՖႠႡႢႣႤႥႦႧႨႩႪႫႬႭႮႯႰႱႲႳႴႵႶႷႸႹႺႻႼႽႾႿჀჁჂჃჄჅჇჍᎠᎡᎢᎣᎤᎥᎦᎧᎨᎩᎪᎫᎬᎭᎮᎯᎰᎱᎲᎳᎴᎵᎶᎷᎸᎹᎺᎻᎼᎽᎾᎿᏀᏁᏂᏃᏄᏅᏆᏇᏈᏉᏊᏋᏌᏍᏎᏏᏐᏑᏒᏓᏔᏕᏖᏗᏘᏙᏚᏛᏜᏝᏞᏟᏠᏡᏢᏣᏤᏥᏦᏧᏨᏩᏪᏫᏬᏭᏮᏯᏰᏱᏲᏳᏴᏵᲐᲑᲒᲓᲔᲕᲖᲗᲘᲙᲚᲛᲜᲝᲞᲟᲠᲡᲢᲣᲤᲥᲦᲧᲨᲩᲪᲫᲬᲭᲮᲯᲰᲱᲲᲳᲴᲵᲶᲷᲸᲹᲺᲽᲾᲿḀḂḄḆḈḊḌḎḐḒḔḖḘḚḜḞḠḢḤḦḨḪḬḮḰḲḴḶḸḺḼḾṀṂṄṆṈṊṌṎṐṒṔṖṘṚṜṞṠṢṤṦṨṪṬṮṰṲṴṶṸṺṼṾẀẂẄẆẈẊẌẎẐẒẔẞẠẢẤẦẨẪẬẮẰẲẴẶẸẺẼẾỀỂỄỆỈỊỌỎỐỒỔỖỘỚỜỞỠỢỤỦỨỪỬỮỰỲỴỶỸỺỼỾἈἉἊἋἌἍἎἏἘἙἚἛἜἝἨἩἪἫἬἭἮἯἸἹἺἻἼἽἾἿὈὉὊὋὌὍὙὛὝὟὨὩὪὫὬὭὮὯᾈᾉᾊᾋᾌᾍᾎᾏᾘᾙᾚᾛᾜᾝᾞᾟᾨᾩᾪᾫᾬᾭᾮᾯᾸᾹᾺΆᾼῈΈῊΉῌῘῙῚΊῨῩῪΎῬῸΌῺΏῼⰀⰁⰂⰃⰄⰅⰆⰇⰈⰉⰊⰋⰌⰍⰎⰏⰐⰑⰒⰓⰔⰕⰖⰗⰘⰙⰚⰛⰜⰝⰞⰟⰠⰡⰢⰣⰤⰥⰦⰧⰨⰩⰪⰫⰬⰭⰮⱠⱢⱣⱤⱧⱩⱫⱭⱮⱯⱰⱲⱵⱾⱿⲀⲂⲄⲆⲈⲊⲌⲎⲐⲒⲔⲖⲘⲚⲜⲞⲠⲢⲤⲦⲨⲪⲬⲮⲰⲲⲴⲶⲸⲺⲼⲾⳀⳂⳄⳆⳈⳊⳌⳎⳐⳒⳔⳖⳘⳚⳜⳞⳠⳢⳫⳭⳲⴀⴁⴂⴃⴄⴅⴆⴇⴈⴉⴊⴋⴌⴍⴎⴏⴐⴑⴒⴓⴔⴕⴖⴗⴘⴙⴚⴛⴜⴝⴞⴟⴠⴡⴢⴣⴤⴥⴧⴭꙀꙂꙄꙆꙈꙊꙌꙎꙐꙒꙔꙖꙘꙚꙜꙞꙠꙢꙤꙦꙨꙪꙬꚀꚂꚄꚆꚈꚊꚌꚎꚐꚒꚔꚖꚘꚚꜢꜤꜦꜨꜪꜬꜮꜲꜴꜶꜸꜺꜼꜾꝀꝂꝄꝆꝈꝊꝌꝎꝐꝒꝔꝖꝘꝚꝜꝞꝠꝢꝤꝦꝨꝪꝬꝮꝹꝻꝽꝾꞀꞂꞄꞆꞋꞍꞐꞒꞖꞘꞚꞜꞞꞠꞢꞤꞦꞨꞪꞫꞬꞭꞮꞰꞱꞲꞳꞴꞶꞸꞺꞼꞾꟂꟄꟅꟆꟇꟉꟵABCDEFGHIJKLMNOPQRSTUVWXYZ

Remarks

It handles umlaut letters as own, as a and á are different, to handle them as the same in compare cases would require a far more complicated solution. Some umlaut characters only exist in Lwr or Upr case, and are ignored.

  • There might be by me undiscovered UFT8 characters for Lwr/Upr conversion.
  • There are about a hundred lower and uppercase charcters with no partner and could obviously not be converted either.
  • All four unicase Georgian scripts asomtavruli, mtavruli, nuskhuri are converted to Mkhedruli in the StrToLwrExt() in order to make texts of the same language with the same letters and content comparable as the same. The StrToUprExt () converts Mkhedruli to the mtavruli.
  • There are 21 pairs of character where one side is two byte and the other is three byte UTF8 and converting them would lead to the risk of sync failure in the strstr() functions.

Capital = Small

0xc8 0xba = 0xe2 0xb1 0xa5

0xc8 0xbe = 0xe2 0xb1 0xa6

0xe1 0xba 0x9e = 0xc3 0x9f

0xe2 0xb1 0xa2 = 0xc9 0xab

0xe2 0xb1 0xa4 = 0xc9 0xbd

0xe2 0xb1 0xad = 0xc9 0x91

0xe2 0xb1 0xae = 0xc9 0xb1

0xe2 0xb1 0xaf = 0xc9 0x90

0xe2 0xb1 0xb0 = 0xc9 0x92

0xe2 0xb1 0xbe = 0xc8 0xbf

0xe2 0xb1 0xbf = 0xc9 0x80

0xea 0x9e 0x8d = 0xc9 0xa5

0xea 0x9e 0xaa = 0xc9 0xa6

0xea 0x9e 0xab = 0xc9 0x9c

0xea 0x9e 0xac = 0xc9 0xa1

0xea 0x9e 0xad = 0xc9 0xac

0xea 0x9e 0xae = 0xc9 0xaa

0xea 0x9e 0xb0 = 0xca 0x9e

0xea 0x9e 0xb1 = 0xca 0x87

0xea 0x9e 0xb2 = 0xca 0x9d

0xea 0x9f 0x85 = 0xca 0x82

The code do not handle recovery by incorrect multibyte chars in the strings (rare problem but distinguished of multichar strings), it will resync. It is not a topic for this answer.

A buffer overrun at runtime is possible (but hardly likely). This happens when a string has incomplete multibyte char at the end. This may happen if strings are cut during processing breaking a multibyte char. But with today's massive supply of memory, allocate memory for a complete string? Else if you want it buffer overrun safe you need to handle that issue yourself. It is not a topic for this answer.

unsigned char* StrToLwrExt(unsigned char* pString)
{
unsigned char* p = pString;
unsigned char* pExtChar = 0;

if (pString && *pString) {
        while (*p) {
            if ((*p >= 0x41) && (*p <= 0x5a)) /* US ASCII */
                (*p) += 0x20;
            else if (*p > 0xc0) {
                pExtChar = p;
                p++;
                switch (*pExtChar) {
                case 0xc3: /* Latin 1 */
                    if ((*p >= 0x80)
                        && (*p <= 0x9e)
                        && (*p != 0x97))
                        (*p) += 0x20; /* US ASCII shift */
                    break;
                case 0xc4: /* Latin ext */
                    if (((*p >= 0x80)
                        && (*p <= 0xb7)
                        && (*p != 0xb0))
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    else if ((*p >= 0xb9)
                        && (*p <= 0xbe)
                        && (*p % 2)) /* Odd */
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xbf) {
                        *pExtChar = 0xc5;
                        (*p) = 0x80;
                    }
                    break;
                case 0xc5: /* Latin ext */
                    if ((*p >= 0x81)
                        && (*p <= 0x88)
                        && (*p % 2)) /* Odd */
                        (*p)++; /* Next char is lwr */
                    else if ((*p >= 0x8a)
                        && (*p <= 0xb7)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xb8) {
                        *pExtChar = 0xc3;
                        (*p) = 0xbf;
                    }
                    else if ((*p >= 0xb9)
                        && (*p <= 0xbe)
                        && (*p % 2)) /* Odd */
                        (*p)++; /* Next char is lwr */
                    break;
                case 0xc6: /* Latin ext */
                    switch (*p) {
                    case 0x81:
                        *pExtChar = 0xc9;
                        (*p) = 0x93;
                        break;
                    case 0x86:
                        *pExtChar = 0xc9;
                        (*p) = 0x94;
                        break;
                    case 0x89:
                        *pExtChar = 0xc9;
                        (*p) = 0x96;
                        break;
                    case 0x8a:
                        *pExtChar = 0xc9;
                        (*p) = 0x97;
                        break;
                    case 0x8e:
                        *pExtChar = 0xc9;
                        (*p) = 0x98;
                        break;
                    case 0x8f:
                        *pExtChar = 0xc9;
                        (*p) = 0x99;
                        break;
                    case 0x90:
                        *pExtChar = 0xc9;
                        (*p) = 0x9b;
                        break;
                    case 0x93:
                        *pExtChar = 0xc9;
                        (*p) = 0xa0;
                        break;
                    case 0x94:
                        *pExtChar = 0xc9;
                        (*p) = 0xa3;
                        break;
                    case 0x96:
                        *pExtChar = 0xc9;
                        (*p) = 0xa9;
                        break;
                    case 0x97:
                        *pExtChar = 0xc9;
                        (*p) = 0xa8;
                        break;
                    case 0x9c:
                        *pExtChar = 0xc9;
                        (*p) = 0xaf;
                        break;
                    case 0x9d:
                        *pExtChar = 0xc9;
                        (*p) = 0xb2;
                        break;
                    case 0x9f:
                        *pExtChar = 0xc9;
                        (*p) = 0xb5;
                        break;
                    case 0xa9:
                        *pExtChar = 0xca;
                        (*p) = 0x83;
                        break;
                    case 0xae:
                        *pExtChar = 0xca;
                        (*p) = 0x88;
                        break;
                    case 0xb1:
                        *pExtChar = 0xca;
                        (*p) = 0x8a;
                        break;
                    case 0xb2:
                        *pExtChar = 0xca;
                        (*p) = 0x8b;
                        break;
                    case 0xb7:
                        *pExtChar = 0xca;
                        (*p) = 0x92;
                        break;
                    case 0x82:
                    case 0x84:
                    case 0x87:
                    case 0x8b:
                    case 0x91:
                    case 0x98:
                    case 0xa0:
                    case 0xa2:
                    case 0xa4:
                    case 0xa7:
                    case 0xac:
                    case 0xaf:
                    case 0xb3:
                    case 0xb5:
                    case 0xb8:
                    case 0xbc:
                        (*p)++; /* Next char is lwr */
                        break;
                    default:
                        break;
                    }
                    break;
                case 0xc7: /* Latin ext */
                    if (*p == 0x84)
                        (*p) = 0x86;
                    else if (*p == 0x85)
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0x87)
                        (*p) = 0x89;
                    else if (*p == 0x88)
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0x8a)
                        (*p) = 0x8c;
                    else if (*p == 0x8b)
                        (*p)++; /* Next char is lwr */
                    else if ((*p >= 0x8d)
                        && (*p <= 0x9c)
                        && (*p % 2)) /* Odd */
                        (*p)++; /* Next char is lwr */
                    else if ((*p >= 0x9e)
                        && (*p <= 0xaf)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xb1)
                        (*p) = 0xb3;
                    else if (*p == 0xb2)
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xb4)
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xb6) {
                        *pExtChar = 0xc6;
                        (*p) = 0x95;
                    }
                    else if (*p == 0xb7) {
                        *pExtChar = 0xc6;
                        (*p) = 0xbf;
                    }
                    else if ((*p >= 0xb8)
                        && (*p <= 0xbf)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    break;
                case 0xc8: /* Latin ext */
                    if ((*p >= 0x80)
                        && (*p <= 0x9f)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xa0) {
                        *pExtChar = 0xc6;
                        (*p) = 0x9e;
                    }
                    else if ((*p >= 0xa2)
                        && (*p <= 0xb3)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xbb)
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xbd) {
                        *pExtChar = 0xc6;
                        (*p) = 0x9a;
                    }
                    /* 0xba three byte small 0xe2 0xb1 0xa5 */
                    /* 0xbe three byte small 0xe2 0xb1 0xa6 */
                    break;
                case 0xc9: /* Latin ext */
                    if (*p == 0x81)
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0x83) {
                        *pExtChar = 0xc6;
                        (*p) = 0x80;
                    }
                    else if (*p == 0x84) {
                        *pExtChar = 0xca;
                        (*p) = 0x89;
                    }
                    else if (*p == 0x85) {
                        *pExtChar = 0xca;
                        (*p) = 0x8c;
                    }
                    else if ((*p >= 0x86)
                        && (*p <= 0x8f)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    break;
                case 0xcd: /* Greek & Coptic */
                    switch (*p) {
                    case 0xb0:
                    case 0xb2:
                    case 0xb6:
                        (*p)++; /* Next char is lwr */
                        break;
                    case 0xbf:
                        *pExtChar = 0xcf;
                        (*p) = 0xb3;
                        break;
                    default:
                        break;
                    }
                    break;
                case 0xce: /* Greek & Coptic */
                    if (*p == 0x86)
                        (*p) = 0xac;
                    else if (*p == 0x88)
                        (*p) = 0xad;
                    else if (*p == 0x89)
                        (*p) = 0xae;
                    else if (*p == 0x8a)
                        (*p) = 0xaf;
                    else if (*p == 0x8c) {
                        *pExtChar = 0xcf;
                        (*p) = 0x8c;
                    }
                    else if (*p == 0x8e) {
                        *pExtChar = 0xcf;
                        (*p) = 0x8d;
                    }
                    else if (*p == 0x8f) {
                        *pExtChar = 0xcf;
                        (*p) = 0x8e;
                    }
                    else if ((*p >= 0x91)
                        && (*p <= 0x9f))
                        (*p) += 0x20; /* US ASCII shift */
                    else if ((*p >= 0xa0)
                        && (*p <= 0xab)
                        && (*p != 0xa2)) {
                        *pExtChar = 0xcf;
                        (*p) -= 0x20;
                    }
                    break;
                case 0xcf: /* Greek & Coptic */
                    if (*p == 0x8f)
                        (*p) = 0x97;
                    else if ((*p >= 0x98)
                        && (*p <= 0xaf)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xb4) {
                        (*p) = 0x91;
                    }
                    else if (*p == 0xb7)
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xb9)
                        (*p) = 0xb2;
                    else if (*p == 0xba)
                        (*p)++; /* Next char is lwr */
                    else if (*p == 0xbd) {
                        *pExtChar = 0xcd;
                        (*p) = 0xbb;
                    }
                    else if (*p == 0xbe) {
                        *pExtChar = 0xcd;
                        (*p) = 0xbc;
                    }
                    else if (*p == 0xbf) {
                        *pExtChar = 0xcd;
                        (*p) = 0xbd;
                    }
                    break;
                case 0xd0: /* Cyrillic */
                    if ((*p >= 0x80)
                        && (*p <= 0x8f)) {
                        *pExtChar = 0xd1;
                        (*p) += 0x10;
                    }
                    else if ((*p >= 0x90)
                        && (*p <= 0x9f))
                        (*p) += 0x20; /* US ASCII shift */
                    else if ((*p >= 0xa0)
                        && (*p <= 0xaf)) {
                        *pExtChar = 0xd1;
                        (*p) -= 0x20;
                    }
                    break;
                case 0xd1: /* Cyrillic supplement */
                    if ((*p >= 0xa0)
                        && (*p <= 0xbf)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    break;
                case 0xd2: /* Cyrillic supplement */
                    if (*p == 0x80)
                        (*p)++; /* Next char is lwr */
                    else if ((*p >= 0x8a)
                        && (*p <= 0xbf)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    break;
                case 0xd3: /* Cyrillic supplement */
                    if (*p == 0x80)
                        (*p) = 0x8f;
                    else if ((*p >= 0x81)
                        && (*p <= 0x8e)
                        && (*p % 2)) /* Odd */
                        (*p)++; /* Next char is lwr */
                    else if ((*p >= 0x90)
                        && (*p <= 0xbf)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    break;
                case 0xd4: /* Cyrillic supplement & Armenian */
                    if ((*p >= 0x80)
                        && (*p <= 0xaf)
                        && (!(*p % 2))) /* Even */
                        (*p)++; /* Next char is lwr */
                    else if ((*p >= 0xb1)
                        && (*p <= 0xbf)) {
                        *pExtChar = 0xd5;
                        (*p) -= 0x10;
                    }
                    break;
                case 0xd5: /* Armenian */
                    if ((*p >= 0x80)
                        && (*p <= 0x8f)) {
                        (*p) += 0x30;
                    }
                    else if ((*p >= 0x90)
                        && (*p <= 0x96)) {
                        *pExtChar = 0xd6;
                        (*p) -= 0x10;
                    }
                    break;
                case 0xe1: /* Three byte code */
                    pExtChar = p;
                    p++;
                    switch (*pExtChar) {
                    case 0x82: /* Georgian asomtavruli */
                        if ((*p >= 0xa0)
                            && (*p <= 0xbf)) {
                            *pExtChar = 0x83;
                            (*p) -= 0x10;
                        }
                        break;
                    case 0x83: /* Georgian asomtavruli */
                        if (((*p >= 0x80)
                            && (*p <= 0x85))
                            || (*p == 0x87)
                            || (*p == 0x8d))
                            (*p) += 0x30;
                        break;
                    case 0x8e: /* Cherokee */
                        if ((*p >= 0xa0)
                            && (*p <= 0xaf)) {
                            *(p - 2) = 0xea;
                            *pExtChar = 0xad;
                            (*p) += 0x10;
                        }
                        else if ((*p >= 0xb0)
                            && (*p <= 0xbf)) {
                            *(p - 2) = 0xea;
                            *pExtChar = 0xae;
                            (*p) -= 0x30;
                        }
                        break;
                    case 0x8f: /* Cherokee */
                        if ((*p >= 0x80)
                            && (*p <= 0xaf)) {
                            *(p - 2) = 0xea;
                            *pExtChar = 0xae;
                            (*p) += 0x10;
                        }
                        else if ((*p >= 0xb0)
                            && (*p <= 0xb5)) {
                            (*p) += 0x08;
                        }
                        /* 0xbe three byte small 0xe2 0xb1 0xa6 */
                        break;
                    case 0xb2: /* Georgian mtavruli */
                        if (((*p >= 0x90)
                            && (*p <= 0xba))
                            || (*p == 0xbd)
                            || (*p == 0xbe)
                            || (*p == 0xbf))
                            *pExtChar = 0x83;
                        break;
                    case 0xb8: /* Latin ext */
                        if ((*p >= 0x80)
                            && (*p <= 0xbf)
                            && (!(*p % 2))) /* Even */
                            (*p)++; /* Next char is lwr */
                        break;
                    case 0xb9: /* Latin ext */
                        if ((*p >= 0x80)
                            && (*p <= 0xbf)
                            && (!(*p % 2))) /* Even */
                            (*p)++; /* Next char is lwr */
                        break;
                    case 0xba: /* Latin ext */
                        if ((*p >= 0x80)
                            && (*p <= 0x94)
                            && (!(*p % 2))) /* Even */
                            (*p)++; /* Next char is lwr */
                        else if ((*p >= 0xa0)
                            && (*p <= 0xbf)
                            && (!(*p % 2))) /* Even */
                            (*p)++; /* Next char is lwr */
                        /* 0x9e Two byte small 0xc3 0x9f */
                        break;
                    case 0xbb: /* Latin ext */
                        if ((*p >= 0x80)
                            && (*p <= 0xbf)
                            && (!(*p % 2))) /* Even */
                            (*p)++; /* Next char is lwr */
                        break;
                    case 0xbc: /* Greek ex */
                        if ((*p >= 0x88)
                            && (*p <= 0x8f))
                            (*p) -= 0x08;
                        else if ((*p >= 0x98)
                            && (*p <= 0x9d))
                            (*p) -= 0x08;
                        else if ((*p >= 0xa8)
                            && (*p <= 0xaf))
                            (*p) -= 0x08;
                        else if ((*p >= 0xb8)
                            && (*p <= 0xbf))
                            (*p) -= 0x08;
                        break;
                    case 0xbd: /* Greek ex */
                        if ((*p >= 0x88)
                            && (*p <= 0x8d))
                            (*p) -= 0x08;
                        else if ((*p == 0x99)
                            || (*p == 0x9b)
                            || (*p == 0x9d)
                            || (*p == 0x9f))
                            (*p) -= 0x08;
                        else if ((*p >= 0xa8)
                            && (*p <= 0xaf))
                            (*p) -= 0x08;
                        break;
                    case 0xbe: /* Greek ex */
                        if ((*p >= 0x88)
                            && (*p <= 0x8f))
                            (*p) -= 0x08;
                        else if ((*p >= 0x98)
                            && (*p <= 0x9f))
                            (*p) -= 0x08;
                        else if ((*p >= 0xa8)
                            && (*p <= 0xaf))
                            (*p) -= 0x08;
                        else if ((*p >= 0xb8)
                            && (*p <= 0xb9))
                            (*p) -= 0x08;
                        else if ((*p >= 0xba)
                            && (*p <= 0xbb)) {
                            *(p - 1) = 0xbd;
                            (*p) -= 0x0a;
                        }
                        else if (*p == 0xbc)
                            (*p) -= 0x09;
                        break;
                    case 0xbf: /* Greek ex */
                        if ((*p >= 0x88)
                            && (*p <= 0x8b)) {
                            *(p - 1) = 0xbd;
                            (*p) += 0x2a;
                        }
                        else if (*p == 0x8c)
                            (*p) -= 0x09;
                        else if ((*p >= 0x98)
                            && (*p <= 0x99))
                            (*p) -= 0x08;
                        else if ((*p >= 0x9a)
                            && (*p <= 0x9b)) {
                            *(p - 1) = 0xbd;
                            (*p) += 0x1c;
                        }
                        else if ((*p >= 0xa8)
                            && (*p <= 0xa9))
                            (*p) -= 0x08;
                        else if ((*p >= 0xaa)
                            && (*p <= 0xab)) {
                            *(p - 1) = 0xbd;
                            (*p) += 0x10;
                        }
                        else if (*p == 0xac)
                            (*p) -= 0x07;
                        else if ((*p >= 0xb8)
                            && (*p <= 0xb9)) {
                            *(p - 1) = 0xbd;
                        }
                        else if ((*p >= 0xba)
                            && (*p <= 0xbb)) {
                            *(p - 1) = 0xbd;
                            (*p) += 0x02;
                        }
                        else if (*p == 0xbc)
                            (*p) -= 0x09;
                        break;
                    default:
                        break;
                    }
                    break;
                case 0xe2: /* Three byte code */
                    pExtChar = p;
                    p++;
                    switch (*pExtChar) {
                    case 0xb0: /* Glagolitic */
                        if ((*p >= 0x80)
                            && (*p <= 0x8f)) {
                            (*p) += 0x30;
                        }
                        else if ((*p >= 0x90)
                            && (*p <= 0xae)) {
                            *pExtChar = 0xb1;
                            (*p) -= 0x10;
                        }
                        break;
                    case 0xb1: /* Latin ext */
                        switch (*p) {
                        case 0xa0:
                        case 0xa7:
                        case 0xa9:
                        case 0xab:
                        case 0xb2:
                        case 0xb5:
                            (*p)++; /* Next char is lwr */
                            break;
                        case 0xa2: /* Two byte small 0xc9 0xab */
                        case 0xa4: /* Two byte small 0xc9 0xbd */
                        case 0xad: /* Two byte small 0xc9 0x91 */
                        case 0xae: /* Two byte small 0xc9 0xb1 */
                        case 0xaf: /* Two byte small 0xc9 0x90 */
                        case 0xb0: /* Two byte small 0xc9 0x92 */
                        case 0xbe: /* Two byte small 0xc8 0xbf */
                        case 0xbf: /* Two byte small 0xc9 0x80 */
                            break;
                        case 0xa3:
                            *(p - 2) = 0xe1;
                            *(p - 1) = 0xb5;
                            *(p) = 0xbd;
                            break;
                        default:
                            break;
                        }
                        break;
                    case 0xb2: /* Coptic */
                        if ((*p >= 0x80)
                            && (*p <= 0xbf)
                            && (!(*p % 2))) /* Even */
                            (*p)++; /* Next char is lwr */
                        break;
                    case 0xb3: /* Coptic */
                        if (((*p >= 0x80)
                            && (*p <= 0xa3)
                            && (!(*p % 2))) /* Even */
                            || (*p == 0xab)
                            || (*p == 0xad)
                            || (*p == 0xb2))
                            (*p)++; /* Next char is lwr */
                        break;
                    case 0xb4: /* Georgian nuskhuri */
                        if (((*p >= 0x80)
                            && (*p <= 0xa5))
                            || (*p == 0xa7)
                            || (*p == 0xad)) {
                            *(p - 2) = 0xe1;
                            *(p - 1) = 0x83;
                            (*p) += 0x10;
                        }
                        break;
                    default:
                        break;
                    }
                    break;
                case 0xea: /* Three byte code */
                    pExtChar = p;
                    p++;
                    switch (*pExtChar) {
                    case 0x99: /* Cyrillic */
                        if ((*p >= 0x80)
                            && (*p <= 0xad)
                            && (!(*p % 2))) /* Even */
                            (*p)++; /* Next char is lwr */
                        break;
                    case 0x9a: /* Cyrillic */
                        if ((*p >= 0x80)
                            && (*p <= 0x9b)
                            && (!(*p % 2))) /* Even */
                            (*p)++; /* Next char is lwr */
                        break;
                    case 0x9c: /* Latin ext */
                        if ((((*p >= 0xa2)
                            && (*p <= 0xaf))
                            || ((*p >= 0xb2)
                                && (*p <= 0xbf)))
                            && (!(*p % 2))) /* Even */
                            (*p)++; /* Next char is lwr */
                        break;
                    case 0x9d: /* Latin ext */
                        if ((((*p >= 0x80)
                            && (*p <= 0xaf))
                            && (!(*p % 2))) /* Even */
                            || (*p == 0xb9)
                            || (*p == 0xbb)
                            || (*p == 0xbe))
                            (*p)++; /* Next char is lwr */
                        else if (*p == 0xbd) {
                            *(p - 2) = 0xe1;
                            *(p - 1) = 0xb5;
                            *(p) = 0xb9;
                        }
                        break;
                    case 0x9e: /* Latin ext */
                        if (((((*p >= 0x80)
                            && (*p <= 0x87))
                            || ((*p >= 0x96)
                                && (*p <= 0xa9))
                            || ((*p >= 0xb4)
                                && (*p <= 0xbf)))
                            && (!(*p % 2))) /* Even */
                            || (*p == 0x8b)
                            || (*p == 0x90)
                            || (*p == 0x92))
                            (*p)++; /* Next char is lwr */
                        else if (*p == 0xb3) {
                            *(p - 2) = 0xea;
                            *(p - 1) = 0xad;
                            *(p) = 0x93;
                        }
                        /* case 0x8d: // Two byte small 0xc9 0xa5 */
                        /* case 0xaa: // Two byte small 0xc9 0xa6 */
                        /* case 0xab: // Two byte small 0xc9 0x9c */
                        /* case 0xac: // Two byte small 0xc9 0xa1 */
                        /* case 0xad: // Two byte small 0xc9 0xac */
                        /* case 0xae: // Two byte small 0xc9 0xaa */
                        /* case 0xb0: // Two byte small 0xca 0x9e */
                        /* case 0xb1: // Two byte small 0xca 0x87 */
                        /* case 0xb2: // Two byte small 0xca 0x9d */
                        break;
                    case 0x9f: /* Latin ext */
                        if ((*p == 0x82)
                            || (*p == 0x87)
                            || (*p == 0x89)
                            || (*p == 0xb5))
                            (*p)++; /* Next char is lwr */
                        else if (*p == 0x84) {
                            *(p - 2) = 0xea;
                            *(p - 1) = 0x9e;
                            *(p) = 0x94;
                        }
                        else if (*p == 0x86) {
                            *(p - 2) = 0xe1;
                            *(p - 1) = 0xb6;
                            *(p) = 0x8e;
                        }
                        /* case 0x85: // Two byte small 0xca 0x82 */
                        break;
                    default:
                        break;
                    }
                    break;
                case 0xef: /* Three byte code */
                    pExtChar = p;
                    p++;
                    switch (*pExtChar) {
                    case 0xbc: /* Latin fullwidth */
                        if ((*p >= 0xa1)
                            && (*p <= 0xba)) {
                            *pExtChar = 0xbd;
                            (*p) -= 0x20;
                        }
                        break;
                    default:
                        break;
                    }
                    break;
                case 0xf0: /* Four byte code */
                    pExtChar = p;
                    p++;
                    switch (*pExtChar) {
                    case 0x90:
                        pExtChar = p;
                        p++;
                        switch (*pExtChar) {
                        case 0x90: /* Deseret */
                            if ((*p >= 0x80)
                                && (*p <= 0x97)) {
                                (*p) += 0x28;
                            }
                            else if ((*p >= 0x98)
                                && (*p <= 0xa7)) {
                                *pExtChar = 0x91;
                                (*p) -= 0x18;
                            }
                            break;
                        case 0x92: /* Osage  */
                            if ((*p >= 0xb0)
                                && (*p <= 0xbf)) {
                                *pExtChar = 0x93;
                                (*p) -= 0x18;
                            }
                            break;
                        case 0x93: /* Osage  */
                            if ((*p >= 0x80)
                                && (*p <= 0x93))
                                (*p) += 0x28;
                            break;
                        case 0xb2: /* Old hungarian */
                            if ((*p >= 0x80)
                                && (*p <= 0xb2))
                                *pExtChar = 0xb3;
                            break;
                        default:
                            break;
                        }
                        break;
                    case 0x91:
                        pExtChar = p;
                        p++;
                        switch (*pExtChar) {
                        case 0xa2: /* Warang citi */
                            if ((*p >= 0xa0)
                                && (*p <= 0xbf)) {
                                *pExtChar = 0xa3;
                                (*p) -= 0x20;
                            }
                            break;
                        default:
                            break;
                        }
                        break;
                    case 0x96:
                        pExtChar = p;
                        p++;
                        switch (*pExtChar) {
                        case 0xb9: /* Medefaidrin */
                            if ((*p >= 0x80)
                                && (*p <= 0x9f)) {
                                (*p) += 0x20;
                            }
                            break;
                        default:
                            break;
                        }
                        break;
                    case 0x9E:
                        pExtChar = p;
                        p++;
                        switch (*pExtChar) {
                        case 0xA4: /* Adlam */
                            if ((*p >= 0x80)
                                && (*p <= 0x9d))
                                (*p) += 0x22;
                            else if ((*p >= 0x9e)
                                && (*p <= 0xa1)) {
                                *(pExtChar) = 0xa5;
                                (*p) -= 0x1e;
                            }
                            break;
                        default:
                            break;
                        }
                        break;
                    default:
                        break;
                    }
                    break;
                default:
                    break;
                }
                pExtChar = 0;
            }
            p++;
        }
    }
    return pString;
}
int StrnCiCmp(const char* s1, const char* s2, size_t ztCount)
{
unsigned char* pStr1Low = 0;
unsigned char* pStr2Low = 0;
unsigned char* p1 = 0; 
unsigned char* p2 = 0; 

    if (s1 && *s1 && s2 && *s2) {
        char cExtChar = 0;
        pStr1Low = (unsigned char*)calloc(strlen(s1) + 1, sizeof(unsigned char));
        if (pStr1Low) {
            pStr2Low = (unsigned char*)calloc(strlen(s2) + 1, sizeof(unsigned char));
            if (pStr2Low) {
                p1 = pStr1Low;
                p2 = pStr2Low;
                strcpy((char*)pStr1Low, s1);
                strcpy((char*)pStr2Low, s2);
                StrToLwrExt(pStr1Low);
                StrToLwrExt(pStr2Low);
                for (; ztCount--; p1++, p2++) {
                    int iDiff = *p1 - *p2;
                    if (iDiff != 0 || !*p1 || !*p2) {
                        free(pStr1Low);
                        free(pStr2Low);
                        return iDiff;
                    }
                }
                free(pStr1Low);
                free(pStr2Low);
                return 0;
            }
            free(pStr1Low);
            return (-1);
        }
        return (-1);
    }
    return (-1);
}
int StrCiCmp(const char* s1, const char* s2)
{
    return StrnCiCmp(s1, s2, (size_t)(-1));
}
char* StrCiStr(const char* s1, const char* s2)
{
char* p = (char*)s1;
size_t len = 0; 

    if (s1 && *s1 && s2 && *s2) {
        len = strlen(s2);
        while (*p) {
            if (StrnCiCmp(p, s2, len) == 0)
                return (char*)p;
            p++;
        }
    }
    return (0);
}
Jan Bergström
  • 736
  • 6
  • 15
  • When I compile this with GCC, I get two warnings about `switch` fall-throughs (they're where the nested switch statements are) -- are they intentional? – Cygon Nov 05 '19 at 18:05
  • Second observation: I tried this code snippet with the string `u8"ÂÊÎÔÛ ÁÉÍÓÚ ÀÈÌÒÙ AÖU"` but the result was `ÂÊÎÔÛ ÁÉÍÓÚ ÀÈÌÒÙ aÖU` (16 characters are still uppercase, two are wrong). – Cygon Nov 05 '19 at 18:07
  • Do you have the hex values of the string? It could be unauthorized other character sets with extended Latin characters within UTF8 that is not documented (in the UTF8 documentation I had) and as such this routine could not identify them. Not having the hex code I can't tell. So if I get the hex code I will be able to tell, Could be bugs in the code? – Jan Bergström Nov 07 '19 at 00:35
  • Thanks for the correction info, two breaks were missing at 0xc6 and 0xcd and I corrected the code. Nested? Mean stacked switch statements? – Jan Bergström Nov 07 '19 at 00:57
  • Basically I wanted to show how to do this. And to me (now) lwr is enough for comparing strings. It could be extended for upr, if someone needs upr text. But that is the same work, just reverse and a job to do. The basic job is reading the UTF8 specification of all character sets and makes the lists. I needed comparing, lwr is enough, and thought it useful for others here. Most developers need not more than the other simplified version, just handle extended Latin. – Jan Bergström Nov 07 '19 at 01:10
  • 2
    I am not part of the standard committee of the C-libraries (that about all programming language libraries are based on). But I think these lists upr and lwr UTF8 conversion should be included in C-library functions, in the future because of the wide use of UTF8. But that is just my opinion, hope they consider including it. – Jan Bergström Nov 07 '19 at 01:10
  • Are there still plans to make a StrToUppExt function? – albert Feb 20 '21 at 11:48
  • My experience is that standard committees are not responding on questions in media. To find out you need to find someone that is a committee member and directly communicate. My experience is that what’s not there isn't and in C-programming you always need to make your own function library to fill your needs. Being happy what’s in the standard libs, far better than nothing. The function for Lwr I provided was due to my need of it (for comparing strings) I had to do it myself, and here I share it with others needing it. I have no need for the Upr (yet) and have other tasks to do. – Jan Bergström Feb 20 '21 at 16:07
  • 1
    I suggest you read the Lwr and make an Upr with the same logic and share it with us. Don't be shy if you publish something with a bug, there will be some trying your stuff for their own needs and check it does it right, if not report the issue. By this we get flawless additions to the ANSI C libs. That is the meaning and benefit with the Stackoverflow, helping each other, take and give. – Jan Bergström Feb 20 '21 at 16:08
  • I agree with your 2 comments. I looked at the Lwr routine for my first usage and does an easy job (compared to the ICU library, that does not yet work fore me). When compiling the Lwr function I got the message: "warning: suggest parentheses around ‘&&’ within ‘||’" so should: case 0x83: // Georgian if ((*p >= 0x80) && ((*p <= 0x85) || (*p == 0x87)) || (*p == 0x8d)) (*p) += 0x30; not be formulated as: case 0x83: // Georgian if ((*p >= 0x80) && (*p <= 0x85)) || (*p == 0x87) || (*p == 0x8d)) (*p) += 0x30; (when responding please use @albert as I'm not the first responding person) – albert Feb 20 '21 at 18:08
  • If you only need one function many commercial libs are over-intelligent and deliver more than asked for, easier to make your own function for the thing you need. Almost every C and C++ programmer has a C and a H-file with their own standard stuff solutions like this. So yes, good idea to make your own function and if you present it here we check it for bugs. Thing is the lwr function set the logic that just have to be reversed, next character has to be previous etc. It is like 2h editing a table, when done it works. If you only need it for Latin try the next answer simpler solution for Upr. – Jan Bergström Feb 21 '21 at 01:09
  • For just Latin it is indeed easy, but I need it for all other as well like Cyrillic, Greek, ... Any comment on my sugested fix? (though the `if ((` should be `if (((*p >= 0x80) && (*p <= 0x85)) || (*p == 0x87) || (*p == 0x8d)) (*p) += 0x30;` – albert Feb 21 '21 at 17:39
  • Well to do the Upr one need to find the tables and compare the characters, it takes time. Thing is the code above here in this answer is not that simple as you suggest because different tables have taken the Upr/lwr shift differently serious. So for US ascii and Latin 1 it is imply if ((*p >= 0x41) && (*p <= 0x5a)) {(*p) += 0x20;} but when it comes to Latin extended you need to handle specific characters and the next is lwr, the previous upr. US and Latin1 is simple (my answer a year before) and this complete answer is longer. If you need Greek, Cyrillic etc as you must make a long version. – Jan Bergström Feb 21 '21 at 20:22
  • Very impressing though it looks like it is not 100% complete (e.g missing: 'ADLAM CAPITAL LETTER SHA' 0xF0 0x9E 0xA4 0xA1). What source did you use to compose your table? – albert Feb 24 '21 at 12:07
  • I will update the function with Adlam tonight (CET) and I use the UTF8 tables I can get over the web. (There are different updates and some are useable and some are incomplete. We need to update this funktion in the future when other alphabets are supported. Well, the topic has to be updated is supported in Win10 from May 2019 and I published the solution in dec 2018 and my Win10 did not support Adlam https://en.wikipedia.org/wiki/Adlam_script – Jan Bergström Feb 24 '21 at 13:06
  • @JanBergström thanks. I think the good source is https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt lease use `@albert` so I get notified when you add a comment. – albert Feb 24 '21 at 17:40
  • @albert Adlam support of Lwr and cmp done. – Jan Bergström Feb 26 '21 at 00:56
  • Thanks a lot, not only adding Adlam but also the upper case function, impressive. – albert Feb 26 '21 at 08:31
  • Please test it, and report any bugs. I have only reedit a copy of the Lwr function reversing the actions. I have not tested it, have no time and no need myself for it right now. When I did Adlam I had to check how it works in anyway and thought I could do the editing, because the function was demanded. I like to have it tested if I need it in the future. – Jan Bergström Feb 26 '21 at 15:41
  • Updated (bugs) Greek extension and to fit the size limit for an answer I cut off the talk in the answer. – Jan Bergström Sep 09 '21 at 00:47
  • converting to lower case and comparing does not perform a correct case insensitive comparison because of the nonbijective nature of the case conversion in unicode. A correct implementation has to understand how these affect the collation order. as a concrete example, based on my understanding, in a case insensitive compare "ß" should compare equal to "SS" but not to "ss". Unless I misunderstood your code, this is not the case. – Tim Seguine Sep 23 '21 at 11:38
  • There are two different levels of the topic. One is understanding of the content especially in a German case of ß etc. Someone could try? This set of function is only direct conversion of about 1500 convertible sets of Upr/Lwr case pairs. The use is for forcing a text to Lwr or Upr case but most of all to perform case insensitive cmp and strstr() operations. The 21 3/2-byte pairs and the about 100-200 no-partner characters are very rare. It means that in practice it is a working implementation for most cases and far better than without, that is the only alternative at the moment. – Jan Bergström Sep 24 '21 at 17:00
  • Because ß 0xc3 0x9f is two byte and ẞ 0xe1 0xba 0x9e is three they are not possible to convert Lwr or Upr without changing the length of the UTF8-character byte array. We do not make the conversion because that would make a programming horror, there will be pointer sync failures else. This must be handled on a character individual basis that handles the change of byte array length issue. The implementation of the Upr ẞ 0xe1 0xba 0x9e has not considered the cmp stsstr issued in case insensitive handling. I think they can't change that and you have to handle it in every program by itself. – Jan Bergström Sep 24 '21 at 17:34
  • @Tim Seguine: This version uses UTF Refs and as such works with the "ß". Please try it. https://www.alphabet.se/download/UtfConv.c – Jan Bergström Oct 13 '21 at 02:02
  • @albert: I tried the CaseFolding.txt and the list is not very complete and only one way to Lwr. I think my list is better, I updated this answer. But best (and code is easier to read what the pairs are) in my latest version: https://www.alphabet.se/download/UtfConv.c – Jan Bergström Oct 13 '21 at 02:06
  • @JanBergström I think you misunderstood my point. My point was: Case folding in Unicode does not preserve string length in general. Therefore implementing case insensitive comparison with case folding is not going to be correct. – Tim Seguine Oct 22 '21 at 14:25
  • @Tim Seguine: Well the string length topic is normally not a problem if the Lwr/Upr conversions are to receive a converted string. But if you do strstr operations it is a sync problem. The problem is that you must count the chars rather than the bytes to point right. But I believe I solved the sync-problem in the “This code is a function set of verified UTF (UTF8, UTF16 and UTF32) Lwr/Upr conversion …” answer below, in the https://www.alphabet.se/download/UtfConv.c file. – Jan Bergström Oct 23 '21 at 16:26
  • @Tim Seguine: The second issue of the UTF string byte length difference is the way the Upr/Lwr operation is performed. Here (as in most OS Upr/Lwr functions) we get a string that we amend with the other case letters and in the “This code is a function set of verified UTF (UTF8, UTF16 and UTF32) Lwr/Upr conversion …” answer, in the alphabet.se/download/UtfConv.c file, we make a new UTF string (that needs to be freed by the programmer later). To amend the input string with longer byte UTF-chars would risk buffer crashes and we can't realloc it because we don't know how it was created. – Jan Bergström Oct 23 '21 at 17:19
  • @JanBergström Again you misunderstand. Why are you telling me about your case folding? I know that the question this is posted on is about case folding, BUT I HAVE ABSOLUTELY NO PROBLEM with your case folding. It looks fine. You included a case insensitive comparison as an additional routine that is based on the case folding. That cannot work as intended for the reasons I mentioned, completely independently of whether the case folding is correct or not. – Tim Seguine Oct 26 '21 at 13:45
  • @Tim Seguine: I think we understand each other much better than you might believe. I understand we are talking about a concept not a problem case. The problem with the 21 pairs of Lwr/Upr UTF8 chars where the sides have different byte length is a serious obstacle/problem to deal with, risk of buffer crashes. In this answer I handle it by not converting them because I am writing in the original string. I was aware of the issue from start. – Jan Bergström Oct 29 '21 at 04:06
  • @Tim Seguine: The other answer I made "This code is a function set of verified UTF (UTF8, UTF16 and UTF32) Lwr/Upr conversion ..." instead uses the UTF32 for the conversion to a new file, and then I have a sync control for the strstr operations making it point right. Your comment made me do that solution because I think you pointed out the right way of doing it right as a concept. This here instead version is more a limited problem solver for the topic of the question here. Your comment was good. – Jan Bergström Oct 29 '21 at 04:07
  • The infamous Turkish problem: Upper-case "i" is dotted-capital-I. Lowercase "I" is dotless-lower-case i. https://ikriv.com/blog/?p=1163 – Robin Davies Jul 30 '23 at 07:36
  • The only solution that is 100% is the answer here under the title, “This code is a function set of verified UTF (UTF8, UTF16 and UTF32) Lwr/Upr conversion and case-insensitive Cmp, strstr with processing using UTF code point reference ids”. Different levels of the needs and side complications. The full solution I mentioned here also solves cases where the upr and lwr case versions have different many bytes. But that makes the implication of different byte length (different data in memory) of in and output. If that is a problem use the simpler solution but with your char it will not work. – Jan Bergström Jul 31 '23 at 14:42
  • I made three different levels of solutions dependent on the needs and implications of differnet byte-length in/out. I use myself "This code is a function set of verified UTF (UTF8, UTF16 and UTF32) Lwr/Upr conversion and case-insensitive Cmp, strstr with processing using UTF code point reference ids”, that makes everything right (when made 2021, I might update if there are needs, added case sensitive UTF char sets). – Jan Bergström Jul 31 '23 at 14:45
6

These case insensitive features are definitely needed in search facilities.

Well, I have the same need as described above and UTF8 is pretty smooth in most ways, but the upper and lower case situations is not that great. Looks like it fall off the todo list when done? Because it has been in the past one of the major topics on the todo list in such cases. I have been patching IBM keyboard driver 1984 before IBM shipped, but copies were available. Also patched Displaywrite 1 and 3 (PC-DOS wordprocessor) before IBM wanted to ship in Europe. Done an awful lot of PC-DOS (CP850) and CP1252 (Windows) to and from national EBCDIC Code pages in IBM 3270 mainframe terminal systems. Them all had this case sensitivity topic on the todo list. In all national ASCII versions and the CP1252 Windows tables had a shift between the 0x40-0x5F and 0x60-0x7F to flip between lower and higher cases (but not PCDOS CP850), by 0x20.

What to do about it?

The tolower() and toupper() will not work in UTF8 multi character strings, outside US-ASCII. They are only working with one byte. But a string solution would work, and there are solutions for about everything else.

Western Europeans are lucky

Well the UTF8 put the CP1252 (Windows 8bit/Latin1) as the first additional table, Latin-1 Supplement (Unicode block), as is. This means that it is possible to shift the Letters (C3XX) like regular US ASCII. Code sample below.

Greeks, Russians, Icelanders and Eastern Europeans are not that lucky

For the Icelanders the Đ/đ - D with stroke (same as the th sound of the word the) is just punched out from CP1252.

The Greeks, Russians and Eastern Europeans ISO8-charsets (CP1253, CP1251 and CP1257) could have been used (as the latin CP1252 was directly used). Then just shifting would also have worked. But instead someone just filled the table pretty randomly (like in the PC-DOC 8-bit ASCII).

There is only one working solution, the same as for PC_DOS ASCII, make translation-tables. I will do it for next X-mas (when I need it bad) but I hint how to do it if someone else is in a hurry.

How to do solutions for the Greeks, Russians, Icelanders and Eastern Europeans

Make different tables relating to the different first byte of the UTF8-table for Eastern Europe, Greek and Cyrillic in the programming code. Fill the tables with the second byte of the letters in its UTF8 second byte positions and exchange the uppercase letters with the matching second byte of the lower cases, and make another one doing the other way around.

Then identify what first byte that relates to each table. That way the programming code can select the right table and just read the right position and get the upper or lower case characters needed. Then modify the letter case functions below (those I have made for Latin1), to use tables instaed of shifting 0x20, for some first UTF8-characters, where tables must be used. It will work smooth and new computers have no problem with memory and power.

UTF8 letter case related functions Latin1 samples

This is working I believe, tried it yet shortly. It only works in Latin-1, and USACII parts of the UTF8.

unsigned char *StrToLwrUft8Latin1(unsigned char *pString)
{
    char cExtChar = 0;
    if (pString && *pString) {
        unsigned char *p = pString;
        while (*p) {
            if (((cExtChar && ((*p >= 0x80) && (*p <= 0xbf)))
                || ((!cExtChar) && (*p <= 0x7f)))
                && ((((*p & 0x7f) + cExtChar) >= 0x40)
                    && (((*p & 0x7f) + cExtChar) <= 0x5f)))
                *p += 0x20;
            if (cExtChar)
                cExtChar = 0;
            else if (*p == 0xc3)
                cExtChar = 0x40;
            p++;
        }
    }
    return pString;
}
unsigned char *StrToUprUft8Latin1(unsigned char *pString)
{
    char cExtChar = 0;
    if (pString && *pString) {
        unsigned char *p = pString;
        while (*p) {
            if (((cExtChar && ((*p >= 0x80) && (*p <= 0xbf)))
                || ((!cExtChar) && (*p <= 0x7f)))
                && ((((*p & 0x7f) + cExtChar) >= 0x60)
                    && (((*p & 0x7f) + cExtChar) <= 0x7e)))
                *p -= 0x20;
            if (cExtChar)
                cExtChar = 0;
            else if (*p == 0xc3)
                cExtChar = 0x40;
            p++;
        }
    }
    return pString;
}
int StrnCiCmpLatin1(const char *s1, const char *s2, size_t ztCount)
{
    unsigned char cExtChar = 0;
    if (s1 && *s1 && s2 && *s2) {
        for (; ztCount--; s1++, s2++) {
            int iDiff = tolower((unsigned char)(*s1 & 0x7f)
                + cExtChar) - tolower((unsigned char)(*s2 & 0x7f) + cExtChar);
            if (iDiff != 0 || !*s1 || !*s2)
                return iDiff;
            if (cExtChar)
                cExtChar = 0;
            else if (((unsigned char )*s2) == ((unsigned char)0xc3))
                cExtChar = 0x40;
        }
    }
    return 0;
}
int StrCiCmpLatin1(const char *s1, const char *s2)
{
    return StrnCiCmpLatin1(s1, s2, (size_t)(-1));
}
char *StrCiStrLatin1(const char *s1, const char *s2)
{
    if (s1 && *s1 && s2 && *s2) {
        char *p = (char *)s1;
        size_t len = strlen(s2);
        while (*p) {
            if (StrnCiCmpLatin1(p, s2, len) == 0)
                return p;
            p++;
        }
    }
    return (0);
}
Jan Bergström
  • 736
  • 6
  • 15
3

There are some examples on StackOverflow but they use wide character strings, and other answers say you shouldn't be using wide character strings for UTF-8.

The article within (utf8everywhere) and answers apply to Windows. The C++ standard requires that wchar_t be wide enough to accomodate all supported code units (32-bits wide) but works perfectly fine with UTF-8. On Windows, wchar_t is UTF-16 but if you're on Windows you have more problems than just that if we're going to be honest (namely their horrifying API).

It also appears that this problem can be very "tricky" in that the output might be dependent upon the user's locale.

Not really. Set the locale inside the code. Some programs like sort don't work properly if you don't set the locale inside the shell for example, so the onus on the user.

I was expecting to just use something like std::toupper(), but the usage is really unclear to me because it seems like I'm not just converting one character at a time but an entire string.

The code example uses iterators. If you don't want to convert every character, don't.

Also, this Ideone example I put together seems to show that toupper() of 0xc3b3 is just 0xc3b3, which is an unexpected result. Calling setlocale to either UTF-8 or ISO8859-1 doesn't appear to change the outcome.

You have undefined behavior. The range of unsigned char is 255. 0xc3b3 way surpasses that.

I'd love some guidance if you could shed some light on either what I'm doing wrong or why my question/premise is faulty!

This example works perfectly fine:

#include <iostream>
#include <string>
#include <locale>

int main()
{
    std::setlocale(LC_CTYPE, "en_US.UTF-8"); // the locale will be the UTF-8 enabled English

    std::wstring str = L"óó";

    std::wcout << str << std::endl;

    for (std::wstring::iterator it = str.begin(); it != str.end(); ++it)
        *it = towupper(*it);

    std::wcout << str << std::endl;
}

Outputs: ÓÓ

user6262916
  • 198
  • 3
  • Thanks for the detailed answer. In response to some of your comments: 1. The reason the usage of `toupper` was confusing to me was that I read that in general case mapping a single byte can be convert to a multi-byte character. So a simple iterator wouldn't work if it was transforming characters in-place. 2. Perhaps my notation for the byte sequence on the `string` was confusing. I meant to say that it was two bytes in a row: 0xc3, 0xb3. They're each valid unsigned chars. Is it safe to say that the only way this can be accomplished is to use `wstring` types? – aardvarkk Apr 27 '16 at 18:36
  • Also, is the UTF-8 locale guaranteed to be available? I'm running this code on an iPhone and I get a NULL return when I try to `setlocale` to UTF-8. – aardvarkk Apr 27 '16 at 18:39
  • And if my source for the text is UTF-8 encoded `string` (not `wstring`), is it trivial to convert from `string` to `wstring`? I think that process would also require knowledge of the UTF-8 encoding. – aardvarkk Apr 27 '16 at 18:49
  • 3
    This doesn't really answer the question as it doesn't actually convert a `std::string`contaning utf8. – Thomas Apr 27 '16 at 20:23
  • 3
    Nothing in the C++ standard requires `wchar_t` to support UTF-32 code units (or Unicode at all). – 一二三 Apr 28 '16 at 02:58
  • There are two topics, 1.) functions handling chars of two or more bytes, 2.) handling the lwr or upr functionality. If one is using only US ASCII characters only the first, this answer is needed. Using other (extended) characters of lwr and upr case versions my answers are needed to be considered for it. The C-library versions I tried only converts US ASCII to lwr or upr case. – Jan Bergström Nov 07 '19 at 01:22
  • Well a UTF32 requires 4 bytes `uint32_t` and a UTF16 2 bytes `wchar_t` or `uint16_t`. Where the `wchar_t` varies in length dependent on the OS, and in MS implementations is only 2 bytes and `wchar_t` can't be used for UTF32 in general, works in UNIX systems. The UTF consortium is very specific in using unsigned datatypes and this is not the case for the `wchar_t` in UNIX implementations, but it works with UNIX. The support of number of different rare UTF16 characters varies a lot in different UNIX OS. – Jan Bergström Oct 23 '21 at 17:14
  • No this is not locale sensitive (the main point of it) and for instance Android NDK C do not support locale. The topic is locale independent as UTF is (the point of the UTF, in difference to 8 bit ASCII). – Jan Bergström Dec 17 '21 at 04:38
3

This code is a function set of verified UTF (UTF8, UTF16 and UTF32) Lwr/Upr conversion and case-insensitive Cmp, strstr with processing using UTF code point reference ids.

Download at: https://www.alphabet.se/download/UtfConv.c

The function set is:

// Utf 8
size_t StrLenUtf8(const Utf8Char* str);
int StrnCmpUtf8(const Utf8Char* Utf8s1, const Utf8Char* Utf8s2, size_t ztCount);
int StrCmpUtf8(const Utf8Char* Utf8s1, const Utf8Char* Utf8s2);
size_t CharLenUtf8(const Utf8Char* pUtf8);
Utf8Char* ForwardUtf8Chars(const Utf8Char* pUtf8, size_t ztForwardUtf8Chars);
size_t StrLenUtf32AsUtf8(const Utf32Char* pUtf32);
Utf8Char* Utf32ToUtf8(const Utf32Char* pUtf32);
Utf32Char* Utf8ToUtf32(const Utf8Char* pUtf8);
Utf16Char* Utf8ToUtf16(const Utf8Char* pUtf8);
Utf8Char* Utf8StrMakeUprUtf8Str(const Utf8Char* pUtf8);
Utf8Char* Utf8StrMakeLwrUtf8Str(const Utf8Char* pUtf8);
int StrnCiCmpUtf8(const Utf8Char* pUtf8s1, const Utf8Char* pUtf8s2, size_t ztCount);
int StrCiCmpUtf8(const Utf8Char* pUtf8s1, const Utf8Char* pUtf8s2);
Utf8Char* StrCiStrUtf8(const Utf8Char* pUtf8s1, const Utf8Char* pUtf8s2);

// Utf 16
size_t StrLenUtf16(const Utf16Char* str);
Utf16Char* StrCpyUtf16(Utf16Char* dest, const Utf16Char* src);
Utf16Char* StrCatUtf16(Utf16Char* dest, const Utf16Char* src);
int StrnCmpUtf16(const Utf16Char* Utf16s1, const Utf16Char* Utf16s2, size_t ztCount);
int StrCmpUtf16(const Utf16Char* Utf16s1, const Utf16Char* Utf16s2);
size_t CharLenUtf16(const Utf16Char* pUtf16);
Utf16Char* ForwardUtf16Chars(const Utf16Char* pUtf16, size_t ztForwardUtf16Chars);
size_t StrLenUtf32AsUtf16(const Utf32Char* pUtf32);
Utf16Char* Utf32ToUtf16(const Utf32Char* pUtf32);
Utf32Char* Utf16ToUtf32(const Utf16Char* pUtf16);
Utf8Char* Utf16ToUtf8(const Utf16Char* pUtf16);
Utf16Char* Utf16StrMakeUprUtf16Str(const Utf16Char* pUtf16);
Utf16Char* Utf16StrMakeLwrUtf16Str(const Utf16Char* pUtf16);
int StrnCiCmpUtf16(const Utf16Char* pUtf16s1, const Utf16Char* pUtf16s2, size_t ztCount);
int StrCiCmpUtf16(const Utf16Char* pUtf16s1, const Utf16Char* pUtf16s2);
Utf16Char* StrCiStrUtf16(const Utf16Char* pUtf16s1, const Utf16Char* pUtf16s2);

// Utf 32
size_t StrLenUtf32(const Utf32Char* str);
Utf32Char* StrCpyUtf32(Utf32Char* dest, const Utf32Char* src);
Utf32Char* StrCatUtf32(Utf32Char* dest, const Utf32Char* src);
int StrnCmpUtf32(const Utf32Char* Utf32s1, const Utf32Char* Utf32s2, size_t ztCount);
int StrCmpUtf32(const Utf32Char* Utf32s1, const Utf32Char* Utf32s2);
Utf32Char* StrToUprUtf32(Utf32Char* pUtf32);
Utf32Char* StrToLwrUtf32(Utf32Char* pUtf32);
int StrnCiCmpUtf32(const Utf32Char* Utf32s1, const Utf32Char* Utf32s2, size_t ztCount);
int StrCiCmpUtf32(const Utf32Char* Utf32s1, const Utf32Char* Utf32s2);
Utf32Char* StrCiStrUtf32(const Utf32Char* Utf32s1, const Utf32Char* Utf32s2);

After reading comments to the ”This code is a carefully tested UTF8 case conversion/case insensitive cmp.” answer and other comments in answers here I made a solution

  • Converts UTF8 and UTF16 strings to UTF32 (UTF code point reference ids)
  • Processing in UTF32, Lwr/Upr-converts 1361 characters back and forth
  • Converts back UTF32 strings back to UTF8 and UTF16 strings
    • With sync control when the converted string has the same number of characters but different number of bytes, pointing at the right character
  • It works with any datatype definitions (are different in different OS)
    • Define statement at the top of the code
    • Utf8Char with signed/unsigned 1 byte 8 bit (unsigned char is default)
    • Utf16Char with at least signed/unsigned 2 byte (wchar_t is default)
    • Utf32Char with at least signed/unsigned 4 byte (uint32_t is default)

The advantages in relation to the ”This code is a carefully tested UTF8 case conversion/case insensitive cmp.” answer are:

  • It is proper programming style (separating string processing and encoding)
  • The full package, a full UTF strings programming tool kit
  • Handles also pairs of different number of UTF8 and UTF16 bytes correct
  • Much better readability of the source code
  • Less risk of bugs, conversions are written by a program reading the UTF definition table
  • Applicable for other UTF16 and UTF32 encoding

The dis-advantages are:

  • The switch cases are many, is less good for performance (not a big thing these days unless huge data volumes to convert)
  • Code is much longer, do not fit in a Stackoverflow anser and have to use a web-link
Jan Bergström
  • 736
  • 6
  • 15
  • Hello. Very nice work. Could you specify which open source license you apply to your code? I would be grateful if I could youse it under MIT license. – Jacek Dec 15 '21 at 06:13
  • To me it is a give away, absolutely free. I don't mind if you attribute it to me, and vote up my answer here. There is a discussion about putting it into Github but this far it is just a simple file, and simple is beautiful. The Github topic is just how to handle formalities. This answer gives you full liberty, I hope. Else I have to spend time on learning alla bout the open source formalities and I have other fish to fry. – Jan Bergström Dec 16 '21 at 15:54
  • This is what I use myself these day's as the third improved complete solution on the topic (them I made above is less code but something there is missing (not wrong). – Jan Bergström Dec 17 '21 at 04:45
1

StrToUprExt()

This answer is a extension to the “This code is a carefully tested UTF8 case conversion/case insensitive cmp.”- answer made above. It is made on demand of having an Upr function as well even though it is not used for strcmp() or strstr() functions.

It is made at the same time as the main answer, getting all the UTF8-charts with two cases (I think I found them all), writing the code with programming assistance, it should cover it all. It is carefully read for bugs.

It is in a separate answer as there is a space limit to answers and the code ddo not fit into the other answer.

unsigned char* StrToUprExt(unsigned char* pString)
{
unsigned char* p = pString;
unsigned char* pExtChar = 0;

    if (pString && *pString) {
        while (*p) {
            if ((*p >= 0x61) && (*p <= 0x7a)) /* US ASCII */
                (*p) -= 0x20;
            else if (*p > 0xc0) {
                pExtChar = p;
                p++;
                switch (*pExtChar) {
                case 0xc3: /* Latin 1 */
                    /* 0x9f Three byte capital 0xe1 0xba 0x9e */
                    if ((*p >= 0xa0)
                        && (*p <= 0xbe)
                        && (*p != 0xb7))
                        (*p) -= 0x20; /* US ASCII shift */
                    else if (*p == 0xbf) {
                        *pExtChar = 0xc5;
                        (*p) = 0xb8;
                    }
                    break;
                case 0xc4: /* Latin ext */
                    if (((*p >= 0x80)
                        && (*p <= 0xb7)
                        && (*p != 0xb1))
                        && (*p % 2)) /* Odd */
                        (*p)--; /* Prev char is upr */
                    else if ((*p >= 0xb9)
                        && (*p <= 0xbe)
                        && (!(*p % 2))) /* Even */
                        (*p)--; /* Prev char is upr */
                    break;
                case 0xc5: /* Latin ext */
                    if (*p == 0x80) {
                        *pExtChar = 0xc4;
                        (*p) = 0xbf;
                    }
                    else if ((*p >= 0x81)
                        && (*p <= 0x88)
                        && (!(*p % 2))) /* Even */
                        (*p)--; /* Prev char is upr */
                    else if ((*p >= 0x8a)
                        && (*p <= 0xb7)
                        && (*p % 2)) /* Odd */
                        (*p)--; /* Prev char is upr */
                    else if (*p == 0xb8) {
                        *pExtChar = 0xc5;
                        (*p) = 0xb8;
                    }
                    else if ((*p >= 0xb9)
                        && (*p <= 0xbe)
                        && (!(*p % 2))) /* Even */
                        (*p)--; /* Prev char is upr */
                    break;
                case 0xc6: /* Latin ext */
                    switch (*p) {
                    case 0x83:
                    case 0x85:
                    case 0x88:
                    case 0x8c:
                    case 0x92:
                    case 0x99:
                    case 0xa1:
                    case 0xa3:
                    case 0xa5:
                    case 0xa8:
                    case 0xad:
                    case 0xb0:
                    case 0xb4:
                    case 0xb6:
                    case 0xb9:
                    case 0xbd:
                        (*p)--; /* Prev char is upr */
                        break;
                    case 0x80:
                        *pExtChar = 0xc9;
                        (*p) = 0x83;
                        break;
                    case 0x95:
                        *pExtChar = 0xc7;
                        (*p) = 0xb6;
                        break;
                    case 0x9a:
                        *pExtChar = 0xc8;
                        (*p) = 0xbd;
                        break;
                    case 0x9e:
                        *pExtChar = 0xc8;
                        (*p) = 0xa0;
                        break;
                    case 0xbf:
                        *pExtChar = 0xc7;
                        (*p) = 0xb7;
                        break;
                    default:
                        break;
                    }
                    break;
                case 0xc7: /* Latin ext */
                    if (*p == 0x85)
                        (*p)--; /* Prev char is upr */
                    else if (*p == 0x86)
                        (*p) = 0x84;
                    else if (*p == 0x88)
                        (*p)--; /* Prev char is upr */
                    else if (*p == 0x89)
                        (*p) = 0x87;
                    else if (*p == 0x8b)
                        (*p)--; /* Prev char is upr */
                    else if (*p == 0x8c)
                        (*p) = 0x8a;
                    else if ((*p >= 0x8d)
                        && (*p <= 0x9c)
                        && (!(*p % 2))) /* Even */
                        (*p)--; /* Prev char is upr */
                    else if ((*p >= 0x9e)
                        && (*p <= 0xaf)
                        && (*p % 2)) /* Odd */
                        (*p)--; /* Prev char is upr */
                    else if (*p == 0xb2)
                        (*p)--; /* Prev char is upr */
                    else if (*p == 0xb3)
                        (*p) = 0xb1;
                    else if (*p == 0xb5)
                        (*p)--; /* Prev char is upr */
                    else if ((*p >= 0xb9)
                        && (*p <= 0xbf)
                        && (*p % 2)) /* Odd */
                        (*p)--; /* Prev char is upr */
                    break;
                case 0xc8: /* Latin ext */
                    if ((*p >= 0x80)
                        && (*p <= 0x9f)
                        && (*p % 2)) /* Odd */
                        (*p)--; /* Prev char is upr */
                    else if ((*p >= 0xa2)
                        && (*p <= 0xb3)
                        && (*p % 2)) /* Odd */
                        (*p)--; /* Prev char is upr */
                    else if (*p == 0xbc)
                        (*p)--; /* Prev char is upr */
                    /* 0xbf Three byte capital 0xe2 0xb1 0xbe */
                    break;
                case 0xc9: /* Latin ext */
                    switch (*p) {
                    case 0x80: /* Three byte capital 0xe2 0xb1 0xbf */
                    case 0x90: /* Three byte capital 0xe2 0xb1 0xaf */
                    case 0x91: /* Three byte capital 0xe2 0xb1 0xad */
                    case 0x92: /* Three byte capital 0xe2 0xb1 0xb0 */
                    case 0x9c: /* Three byte capital 0xea 0x9e 0xab */
                    case 0xa1: /* Three byte capital 0xea 0x9e 0xac */
                    case 0xa5: /* Three byte capital 0xea 0x9e 0x8d */
                    case 0xa6: /* Three byte capital 0xea 0x9e 0xaa */
                    case 0xab: /* Three byte capital 0xe2 0xb1 0xa2 */
                    case 0xac: /* Three byte capital 0xea 0x9e 0xad */
                    case 0xb1: /* Three byte capital 0xe2 0xb1 0xae */
                    case 0xbd: /* Three byte capital 0xe2 0xb1 0xa4 */
                        break;
                    case 0x82:
                        (*p)--; /* Prev char is upr */
                        break;
                    case 0x93:
                        *pExtChar = 0xc6;
                        (*p) = 0x81;
                        break;
                    case 0x94:
                        *pExtChar = 0xc6;
                        (*p) = 0x86;
                        break;
                    case 0x96:
                        *pExtChar = 0xc6;
                        (*p) = 0x89;
                        break;
                    case 0x97:
                        *pExtChar = 0xc6;
                        (*p) = 0x8a;
                        break;
                    case 0x98:
                        *pExtChar = 0xc6;
                        (*p) = 0x8e;
                        break;
                    case 0x99:
                        *pExtChar = 0xc6;
                        (*p) = 0x8f;
                        break;
                    case 0x9b:
                        *pExtChar = 0xc6;
                        (*p) = 0x90;
                        break;
                    case 0xa0:
                        *pExtChar = 0xc6;
                        (*p) = 0x93;
                        break;
                    case 0xa3:
                        *pExtChar = 0xc6;
                        (*p) = 0x94;
                        break;
                    case 0xa8:
                        *pExtChar = 0xc6;
                        (*p) = 0x97;
                        break;
                    case 0xa9:
                        *pExtChar = 0xc6;
                        (*p) = 0x96;
                        break;
                    case 0xaf:
                        *pExtChar = 0xc6;
                        (*p) = 0x9c;
                        break;
                    case 0xb2:
                        *pExtChar = 0xc6;
                        (*p) = 0x9d;
                        break;
                    case 0xb5:
                        *pExtChar = 0xc6;
                        (*p) = 0x9f;
                        break;
                    default:
                        if ((*p >= 0x87)
                            && (*p <= 0x8f)
                            && (*p % 2)) /* Odd */
                            (*p)--; /* Prev char is upr */
                        break;
                    }
                    break;

                case 0xca: /* Latin ext */
                    switch (*p) {
                    case 0x82: /* Three byte capital 0xea 0x9f 0x85 */
                    case 0x87: /* Three byte capital 0xea 0x9e 0xb1 */
                    case 0x9d: /* Three byte capital 0xea 0x9e 0xb2 */
                    case 0x9e: /* Three byte capital 0xea 0x9e 0xb0 */
                        break;
                    case 0x83:
                        *pExtChar = 0xc6;
                        (*p) = 0xa9;
                        break;
                    case 0x88:
                        *pExtChar = 0xc6;
                        (*p) = 0xae;
                        break;
                    case 0x89:
                        *pExtChar = 0xc9;
                        (*p) = 0x84;
                        break;
                    case 0x8a:
                        *pExtChar = 0xc6;
                        (*p) = 0xb1;
                        break;
                    case 0x8b:
                        *pExtChar = 0xc6;
                        (*p) = 0xb2;
                        break;
                    case 0x8c:
                        *pExtChar = 0xc9;
                        (*p) = 0x85;
                        break;
                    case 0x92:
                        *pExtChar = 0xc6;
                        (*p) = 0xb7;
                        break;
                    default:
                        break;
                    }
                    break;
                case 0xcd: /* Greek & Coptic */
                    switch (*p) {
                    case 0xb1:
                    case 0xb3:
                    case 0xb7:
                        (*p)--; /* Prev char is upr */
                        break;
                    case 0xbb:
                        *pExtChar = 0xcf;
                        (*p) = 0xbd;
                        break;
                    case 0xbc:
                        *pExtChar = 0xcf;
                        (*p) = 0xbe;
                        break;
                    case 0xbd:
                        *pExtChar = 0xcf;
                        (*p) = 0xbf;
                        break;
                    default:
                        break;
                    }
                    break;
                case 0xce: /* Greek & Coptic */
                    if (*p == 0xac)
                        (*p) = 0x86;
                    else if (*p == 0xad)
                        (*p) = 0x88;
                    else if (*p == 0xae)
                        (*p) = 0x89;
                    else if (*p == 0xaf)
                        (*p) = 0x8a;
                    else if ((*p >= 0xb1)
                        && (*p <= 0xbf))
                        (*p) -= 0x20; /* US ASCII shift */
                    break;
                case 0xcf: /* Greek & Coptic */
                    if (*p == 0x82) {
                        *pExtChar = 0xce;
                        (*p) = 0xa3;
                    }
                    else if ((*p >= 0x80)
                        && (*p <= 0x8b)) {
                        *pExtChar = 0xce;
                        (*p) += 0x20;
                    }
                    else if (*p == 0x8c) {
                        *pExtChar = 0xce;
                        (*p) = 0x8c;
                    }
                    else if (*p == 0x8d) {
                        *pExtChar = 0xce;
                        (*p) = 0x8e;
                    }
                    else if (*p == 0x8e) {
                        *pExtChar = 0xce;
                        (*p) = 0x8f;
                    }
                    else if (*p == 0x91)
                        (*p) = 0xb4;
                    else if (*p == 0x97)
                        (*p) = 0x8f;
                    else if ((*p >= 0x98)
                        && (*p <= 0xaf)
                        && (*p % 2)) /* Odd */
                        (*p)--; /* Prev char is upr */
                    else if (*p == 0xb2)
                        (*p) = 0xb9;
                    else if (*p == 0xb3) {
                        *pExtChar = 0xcd;
                        (*p) = 0xbf;
                    }
                    else if (*p == 0xb8)
                        (*p)--; /* Prev char is upr */
                    else if (*p == 0xbb)
                        (*p)--; /* Prev char is upr */
                    break;
                case 0xd0: /* Cyrillic */
                    if ((*p >= 0xb0)
                        && (*p <= 0xbf))
                        (*p) -= 0x20; /* US ASCII shift */
                    break;
                case 0xd1: /* Cyrillic supplement */
                    if ((*p >= 0x80)
                        && (*p <= 0x8f)) {
                        *pExtChar = 0xd0;
                        (*p) += 0x20;
                    }
                    else if ((*p >= 0x90)
                        && (*p <= 0x9f)) {
                        *pExtChar = 0xd0;
                        (*p) -= 0x10;
                    }
                    else if ((*p >= 0xa0)
                        && (*p <= 0xbf)
                        && (*p % 2)) /* Odd */
                        (*p)--; /* Prev char is upr */
                    break;
                case 0xd2: /* Cyrillic supplement */
                    if (*p == 0x81)
                        (*p)--; /* Prev char is upr */
                    else if ((*p >= 0x8a)
                        && (*p <= 0xbf)
                        && (*p % 2)) /* Odd */
                        (*p)--; /* Prev char is upr */
                    break;
                case 0xd3: /* Cyrillic supplement */
                    if ((*p >= 0x81)
                        && (*p <= 0x8e)
                        && (!(*p % 2))) /* Even */
                        (*p)--; /* Prev char is upr */
                    else if (*p == 0x8f)
                        (*p) = 0x80;
                    else if ((*p >= 0x90)
                        && (*p <= 0xbf)
                        && (*p % 2)) /* Odd */
                        (*p)--; /* Prev char is upr */
                    break;
                case 0xd4: /* Cyrillic supplement & Armenian */
                    if ((*p >= 0x80)
                        && (*p <= 0xaf)
                        && (*p % 2)) /* Odd */
                        (*p)--; /* Prev char is upr */
                    break;
                case 0xd5: /* Armenian */
                    if ((*p >= 0xa1)
                        && (*p <= 0xaf)) {
                        *pExtChar = 0xd4;
                        (*p) += 0x10;
                    }
                    else if ((*p >= 0xb0)
                        && (*p <= 0xbf)) {
                        (*p) -= 0x30;
                    }
                    break;
                case 0xd6: /* Armenian */
                    if ((*p >= 0x80)
                        && (*p <= 0x86)) {
                        *pExtChar = 0xd5;
                        (*p) += 0x10;
                    }
                    break;
                case 0xe1: /* Three byte code */
                    pExtChar = p;
                    p++;
                    switch (*pExtChar) {
                    case 0x82: /* Georgian Asomtavruli  */
                        if ((*p >= 0xa0)
                            && (*p <= 0xbf)) {
                            *pExtChar = 0xb2;
                            (*p) -= 0x10;
                        }
                        break;
                    case 0x83: /* Georgian */
                        /* Georgian Asomtavruli  */
                        if (((*p >= 0x80)
                            && (*p <= 0x85))
                            || (*p == 0x87)
                            || (*p == 0x8d)) {
                            *pExtChar = 0xb2;
                            (*p) += 0x30;
                        }
                        /* Georgian mkhedruli */
                        else if (((*p >= 0x90)
                            && (*p <= 0xba))
                            || (*p == 0xbd)
                            || (*p == 0xbe)
                            || (*p == 0xbf)) {
                            *pExtChar = 0xb2;
                        }
                        break;
                    case 0x8f: /* Cherokee */
                        if ((*p >= 0xb8)
                            && (*p <= 0xbd)) {
                            (*p) -= 0x08;
                        }
                        break;
                    case 0xb5: /* Latin ext */
                        if (*p == 0xb9) {
                            *(p - 2) = 0xea;
                            *(p - 1) = 0x9d;
                            (*p) = 0xbd;
                        }
                        else if (*p == 0xbd) {
                            *(p - 2) = 0xe2;
                            *(p - 1) = 0xb1;
                            (*p) = 0xa3;
                        }
                        break;
                    case 0xb6: /* Latin ext */
                        if (*p == 0x8e) {
                            *(p - 2) = 0xea;
                            *(p - 1) = 0x9f;
                            (*p) = 0x86;
                        }
                        break;
                    case 0xb8: /* Latin ext */
                        if ((*p >= 0x80)
                            && (*p <= 0xbf)
                            && (*p % 2)) /* Odd */
                            (*p)--; /* Prev char is upr */
                        break;
                    case 0xb9: /* Latin ext */
                        if ((*p >= 0x80)
                            && (*p <= 0xbf)
                            && (*p % 2)) /* Odd */
                            (*p)--; /* Prev char is upr */
                        break;
                    case 0xba: /* Latin ext */
                        if ((*p >= 0x80)
                            && (*p <= 0x95)
                            && (*p % 2)) /* Odd */
                            (*p)--; /* Prev char is upr */
                        else if ((*p >= 0xa0)
                            && (*p <= 0xbf)
                            && (*p % 2)) /* Odd */
                            (*p)--; /* Prev char is upr */
                        break;
                    case 0xbb: /* Latin ext */
                        if ((*p >= 0x80)
                            && (*p <= 0xbf)
                            && (*p % 2)) /* Odd */
                            (*p)--; /* Prev char is upr */
                        break;
                    case 0xbc: /* Greek ext */
                        if ((*p >= 0x80)
                            && (*p <= 0x87))
                            (*p) += 0x08;
                        else if ((*p >= 0x90)
                            && (*p <= 0x95))
                            (*p) += 0x08;
                        else if ((*p >= 0xa0)
                            && (*p <= 0xa7))
                            (*p) += 0x08;
                        else if ((*p >= 0xb0)
                            && (*p <= 0xb7))
                            (*p) += 0x08;
                        break;
                    case 0xbd: /* Greek ext */
                        if ((*p >= 0x80)
                            && (*p <= 0x85))
                            (*p) += 0x08;
                        else if ((*p == 0x91)
                            || (*p == 0x93)
                            || (*p == 0x95)
                            || (*p == 0x97))
                            (*p) += 0x08;
                        else if ((*p >= 0xa0)
                            && (*p <= 0xa7))
                            (*p) += 0x08;
                        else if ((*p >= 0xb0)
                            && (*p <= 0xb1)) {
                            *(p - 1) = 0xbe;
                            (*p) += 0x0a;
                        }
                        else if ((*p >= 0xb2)
                            && (*p <= 0xb5)) {
                            *(p - 1) = 0xbf;
                            (*p) -= 0x2a;
                        }
                        else if ((*p >= 0xb6)
                            && (*p <= 0xb7)) {
                            *(p - 1) = 0xbf;
                            (*p) -= 0x1c;
                        }
                        else if ((*p >= 0xb8)
                            && (*p <= 0xb9)) {
                            *(p - 1) = 0xbf;
                        }
                        else if ((*p >= 0xba)
                            && (*p <= 0xbb)) {
                            *(p - 1) = 0xbf;
                            (*p) -= 0x10;
                        }
                        else if ((*p >= 0xbc)
                            && (*p <= 0xbd)) {
                            *(p - 1) = 0xbf;
                            (*p) -= 0x02;
                        }
                        break;
                    case 0xbe: /* Greek ext */
                        if ((*p >= 0x80)
                            && (*p <= 0x87))
                            (*p) += 0x08;
                        else if ((*p >= 0x90)
                            && (*p <= 0x97))
                            (*p) += 0x08;
                        else if ((*p >= 0xa0)
                            && (*p <= 0xa7))
                            (*p) += 0x08;
                        else if ((*p >= 0xb0)
                            && (*p <= 0xb1))
                            (*p) += 0x08;
                        else if (*p == 0xb3)
                            (*p) += 0x09;
                        break;
                    case 0xbf: /* Greek ext */
                        if (*p == 0x83)
                            (*p) += 0x09;
                        else if ((*p >= 0x90)
                            && (*p <= 0x91))
                            *p += 0x08;
                        else if ((*p >= 0xa0)
                            && (*p <= 0xa1))
                            (*p) += 0x08;
                        else if (*p == 0xa5)
                            (*p) += 0x07;
                        else if (*p == 0xb3)
                            (*p) += 0x09;
                        break;
                    default:
                        break;
                    }
                    break;
                case 0xe2: /* Three byte code */
                    pExtChar = p;
                    p++;
                    switch (*pExtChar) {
                    case 0xb0: /* Glagolitic  */
                        if ((*p >= 0xb0)
                            && (*p <= 0xbf)) {
                            (*p) -= 0x30;
                        }
                        break;
                    case 0xb1: /* Glagolitic */
                        if ((*p >= 0x80)
                            && (*p <= 0x9e)) {
                            *pExtChar = 0xb0;
                            (*p) += 0x10;
                        }
                        else { /* Latin ext */
                            switch (*p) {
                            case 0xa1:
                            case 0xa8:
                            case 0xaa:
                            case 0xac:
                            case 0xb3:
                            case 0xb6:
                                (*p)--; /* Prev char is upr */
                                break;
                            case 0xa5: /* Two byte capital  0xc8 0xba */
                            case 0xa6: /* Two byte capital  0xc8 0xbe */
                                break;
                            default:
                                break;
                            }
                        }
                        break;
                    case 0xb2: /* Coptic */
                        if ((*p >= 0x80)
                            && (*p <= 0xbf)
                            && (*p % 2)) /* Odd */
                            (*p)--; /* Prev char is upr */
                        break;
                    case 0xb3: /* Coptic */
                        if (((*p >= 0x80)
                            && (*p <= 0xa3)
                            && (*p % 2)) /* Odd */
                            || (*p == 0xac)
                            || (*p == 0xae)
                            || (*p == 0xb3))
                            (*p)--; /* Prev char is upr */
                        break;
                    case 0xb4: /* Georgian */
                        if (((*p >= 0x80)
                            && (*p <= 0xa5))
                            || (*p == 0xa7)
                            || (*p == 0xad)) {
                            *(p - 2) = 0xe1;
                            *(p - 1) = 0xb2;
                            *(p) += 0x10;
                        }
                        break;
                    default:
                        break;
                    }
                    break;
                case 0xea: /* Three byte code */
                    pExtChar = p;
                    p++;
                    switch (*pExtChar) {
                    case 0x99: /* Cyrillic */
                        if ((*p >= 0x80)
                            && (*p <= 0xad)
                            && (*p % 2)) /* Odd */
                            (*p)--; /* Prev char is upr */
                        break;
                    case 0x9a: /* Cyrillic */
                        if ((*p >= 0x80)
                            && (*p <= 0x9b)
                            && (*p % 2)) /* Odd */
                            (*p)--; /* Prev char is upr */
                        break;
                    case 0x9c: /* Latin ext */
                        if ((((*p >= 0xa2)
                            && (*p <= 0xaf))
                            || ((*p >= 0xb2)
                                && (*p <= 0xbf)))
                            && (*p % 2)) /* Odd */
                            (*p)--; /* Prev char is upr */
                        break;
                    case 0x9d: /* Latin ext */
                        if (((*p >= 0x80)
                            && (*p <= 0xaf)
                            && (*p % 2)) /* Odd */
                            || (*p == 0xba)
                            || (*p == 0xbc)
                            || (*p == 0xbf))
                            (*p)--; /* Prev char is upr */
                        break;
                    case 0x9e: /* Latin ext */
                        if (((((*p >= 0x80)
                            && (*p <= 0x87))
                            || ((*p >= 0x96)
                                && (*p <= 0xa9))
                            || ((*p >= 0xb4)
                                && (*p <= 0xbf)))
                            && (*p % 2)) /* Odd */
                            || (*p == 0x8c)
                            || (*p == 0x91)
                            || (*p == 0x93))
                            (*p)--; /* Prev char is upr */
                        else if (*p == 0x94) {
                            *(p - 2) = 0xea;
                            *(p - 1) = 0x9f;
                            *(p) = 0x84;
                        }
                        break;
                    case 0x9f: /* Latin ext */
                        if ((*p == 0x83)
                            || (*p == 0x88)
                            || (*p == 0x8a)
                            || (*p == 0xb6))
                            (*p)--; /* Prev char is upr */
                        break;
                    case 0xad:
                        /* Latin ext */
                        if (*p == 0x93) {
                            *pExtChar = 0x9e;
                            (*p) = 0xb3;
                        }
                        /* Cherokee */
                        else if ((*p >= 0xb0)
                            && (*p <= 0xbf)) {
                            *(p - 2) = 0xe1;
                            *pExtChar = 0x8e;
                            (*p) -= 0x10;
                        }
                        break;
                    case 0xae: /* Cherokee */
                        if ((*p >= 0x80)
                            && (*p <= 0x8f)) {
                            *(p - 2) = 0xe1;
                            *pExtChar = 0x8e;
                            (*p) += 0x30;
                        }
                        else if ((*p >= 0x90)
                            && (*p <= 0xbf)) {
                            *(p - 2) = 0xe1;
                            *pExtChar = 0x8f;
                            (*p) -= 0x10;
                        }
                        break;
                    default:
                        break;
                    }
                    break;
                case 0xef: /* Three byte code */
                    pExtChar = p;
                    p++;
                    switch (*pExtChar) {
                    case 0xbd: /* Latin fullwidth */
                        if ((*p >= 0x81)
                            && (*p <= 0x9a)) {
                            *pExtChar = 0xbc;
                            (*p) += 0x20;
                        }
                        break;
                    default:
                        break;
                    }
                    break;
                case 0xf0: /* Four byte code */
                    pExtChar = p;
                    p++;
                    switch (*pExtChar) {
                    case 0x90:
                        pExtChar = p;
                        p++;
                        switch (*pExtChar) {
                        case 0x90: /* Deseret */
                            if ((*p >= 0xa8)
                                && (*p <= 0xbf)) {
                                (*p) -= 0x28;
                            }
                            break;
                        case 0x91: /* Deseret */
                            if ((*p >= 0x80)
                                && (*p <= 0x8f)) {
                                *pExtChar = 0x90;
                                (*p) += 0x18;
                            }
                            break;
                        case 0x93: /* Osage  */
                            if ((*p >= 0x98)
                                && (*p <= 0xa7)) {
                                *pExtChar = 0x92;
                                (*p) += 0x18;
                            }
                            else if ((*p >= 0xa8)
                                && (*p <= 0xbb))
                                (*p) -= 0x28;
                            break;
                        case 0xb3: /* Old hungarian */
                            if ((*p >= 0x80)
                                && (*p <= 0xb2))
                                *pExtChar = 0xb2;
                            break;
                        default:
                            break;
                        }
                        break;
                    case 0x91:
                        pExtChar = p;
                        p++;
                        switch (*pExtChar) {
                        case 0xa3: /* Warang citi */
                            if ((*p >= 0x80)
                                && (*p <= 0x9f)) {
                                *pExtChar = 0xa2;
                                (*p) += 0x20;
                            }
                            break;
                        default:
                            break;
                        }
                        break;
                    case 0x96:
                        pExtChar = p;
                        p++;
                        switch (*pExtChar) {
                        case 0xb9: /* Medefaidrin */
                            if ((*p >= 0xa0)
                                && (*p <= 0xbf))
                                (*p) -= 0x20;
                            break;
                        default:
                            break;
                        }
                        break;
                    case 0x9E:
                        pExtChar = p;
                        p++;
                        switch (*pExtChar) {
                        case 0xA4: /* Adlam */
                            if ((*p >= 0xa2)
                                && (*p <= 0xbf))
                                (*p) -= 0x22;
                            break;
                        case 0xA5: /* Adlam */
                            if ((*p >= 0x80)
                                && (*p <= 0x83)) {
                                *(pExtChar) = 0xa4;
                                (*p) += 0x1e;
                            }
                            break;
                        default:
                            break;
                        }
                        break;
                    }
                    break;
                default:
                    break;
                }
                pExtChar = 0;
            }
            p++;
        }
    }
    return pString;
}
Jan Bergström
  • 736
  • 6
  • 15