0

I've got a string I want to capitalize, but it might contain polish special letters (ą, ć, ę, ł, ń, ó, ś, ż, ź). The function transform(string.begin(), string.end(), string.begin(), ::toupper); only capitalizes the latin alphabet, so I wrote a function like this:


    string to_upper(string nazwa)
    {
        transform(nazwa.begin(), nazwa.end(), nazwa.begin(), ::toupper);

        for (int i = 0; i < (int)nazwa.size(); i++)
        {
            switch(nazwa[i])
            {
                case u'ą':
                {
                    nazwa[i] = u'Ą';
                    break;
                }
                case u'ć':
                {
                    nazwa[i] = u'Ć';
                    break;
                }
                case u'ę':
                {
                    nazwa[i] = u'Ę';
                    break;
                }
                case u'ó':
                {
                    nazwa[i] = u'Ó';
                    break;
                }
                case u'ł':
                {
                    nazwa[i] = u'Ł';
                    break;
                }
                case u'ń':
                {
                    nazwa[i] = u'Ń';
                    break;
                }
                case u'ś':
                {
                    nazwa[i] = u'Ś';
                    break;
                }
                case u'ż':
                {
                    nazwa[i] = u'Ż';
                    break;
                }
                case u'ź':
                {
                    nazwa[i] = u'Ź';
                    break;
                }
            }
        }

        return nazwa;
    }

I also tried using if instead of switch but it doesn't change anything. In Qt Creator next to every capital letter to be inserted apart from u'Ó' gives me a similar error: Implicit conversion from 'char16_t' to 'std::basic_string<char>::value_type' (aka 'char') changes value from 260 to 4 (this is from u'Ą'). After running the program, the chars in the string aren't swaped.

Greg
  • 71
  • 6
  • You will need to make use of unicode code points. As these letters are not represented in the ASCII table. Take a look at https://stackoverflow.com/questions/331690 and https://stackoverflow.com/questions/3010739 – binaryescape Aug 02 '23 at 09:43
  • @Yunnosch OP mentioned it in the question. This would only work if you set locale that supports Polish characters. – Yksisarvinen Aug 02 '23 at 09:56
  • I've got this and it still doesn't work: ` setlocale(LC_ALL, "pl_PL"); unsigned char c = nazwa[i]; nazwa[i] = toupper(c);` – Greg Aug 02 '23 at 10:04
  • 3
    @Greg - You need to use either the (templated) `std::toupper()` which accepts a `const std::locale &` as the second argument or `std::towupper()`. Either way, you will need to pass a wide character (e.g. `wchar_t` not a `char` type) Bear in mind that both only map one character to one character (e.g. they cannot be used if the conversion to uppercase maps one character to a pair of characters). – Peter Aug 02 '23 at 10:33
  • You'll need to start by deciding which encoding the `string` is encoded with. `std::string` does not imply any particular encoding. – user17732522 Aug 02 '23 at 10:34
  • @Yksisarvinen Thanks, missed it for lack of `()`. My mistake. And I was not aware of the locale dependency. (Happy to have asked, instead of spouting wrong non-solutions.... ;-) ) – Yunnosch Aug 02 '23 at 10:46
  • @Peter - I've got this and none of the two last lines work (I'm not using both at the same time), I still get a small letter: setlocale(LC_ALL, "pl_PL"); wchar_t c = nazwa[i]; nazwa[i] = toupper(c, locale("pl_PL")); nazwa[i] = towupper(c); – Greg Aug 02 '23 at 10:52
  • This might be helpful, not to solve the encoding issue but to optimize your code, look at the following: https://old.unicode-table.com/en/blocks/latin-extended-a/ In the Unicode table, the value of capital letters are just the value of the smaller ones - 1. So just check if the chracater code is between 0x100 and 0x17E, and if it is an odd number, subtract 1 from it to make it a capital letter. Though it might be hard to put in practice because it's likely that your ``std::string`` here is encoded in [UTF-8](https://en.wikipedia.org/wiki/UTF-8), so one char is not equal to one letter. – RedStoneMatt Aug 02 '23 at 11:21
  • What encoding are you using? 8859-2? Unicode UTF-8 (65001)? Windows-1250? Mac-10029? – Eljay Aug 02 '23 at 11:22
  • @RedStoneMatt - I've already tried that, it doesn't work, for 'ą' it gives me a capital A umlaut. – Greg Aug 02 '23 at 11:28
  • @Eljay - I think I'm using UTF-8 – Greg Aug 02 '23 at 11:28
  • Another option is to use the [ICU](https://unicode-org.github.io/icu/userguide/icu/howtouseicu.html) library. – Eljay Aug 02 '23 at 12:16
  • What is your OS and compiler? And where do you get your string? (Terminal, disk file, network connection, something else?) – n. m. could be an AI Aug 02 '23 at 12:28
  • More importantly, *why* do you need to capitalise the string? Most people who think they need to capitalise strings in reality only need case-folding collation, which is a much easier problem (capitalisation in general is surprisingly hard if you need to deal with many languages). – n. m. could be an AI Aug 02 '23 at 12:43
  • @Greg "I've already tried that, it doesn't work, for 'ą' it gives me a capital A umlaut" likely because you subtracted 1 to both bytes that composed the ``ą``. Please see my full answer below, it explains how UTF-8 works with your specific case and gives a function that should do the job for you. – RedStoneMatt Aug 02 '23 at 12:49

3 Answers3

2

The source of your issue

std::string stores characters as chars, which are one byte long, and therefore their value can only go from 0 to 255.

This makes it impossible to store u'ą' in one char for example, as the unicode value for ą is 0x105 (= 261 in decimal, which is higher than 255).

To avoid this problem, humans have invented UTF-8, which is a character encoding standard that lets you encode any Unicode characters as bytes. Characters that have a higher value will of course take multiple bytes to encode.

It is very likely that your std::string have its characters encoded in UTF-8. (I say very likely because your code doesn't directly indicate it, but it is pretty much 100% certain that it is the case, because it's the only universal way to encode accented letters in char-based strings. To be absolutely 100% sure, you'd need to check Qt's code, since it seems to be what you are using)

The result of this is that you can't just use a for to iterate through the chars of your std::string the way that you are because you basically assume that one char equals one letter, which is simply not the case.

In the case of ą for example, it'll be encoded as bytes C4 85, so you will have one char that will have the value 0xC4 (= 196) followed by another char of value 0x85 (= 133).


The specific case for the characters you want to capitalize

The Latin Extended-A part of the Unicode table (archive) fortunately shows us that these special capital letters come right before their lowercase counterparts.

More than that, we can see that:

  • From Unicode index 0x100 to 0x137 (both included), lowercase letters are the odd indices.
  • From 0x139 to 0x148 (both included), lowercases are the even indices.
  • From 0x14A to 0x177 (both included), lowercases are the odd indices.
  • From 0x179 to 0x17E (both included), lowercases are the even indices.

This will make it easier to convert lowercase code points to uppercase ones, since all we have to do is check if the index of a character corresponds to a lowercase one, and if so, subtract one to it to make it uppercase.


Encoding one of those characters in UTF-8

To encode these in UTF-8 (source):

  • Convert the code point (the Unicode value if you prefer to say it like that) in binary
  • The first byte of your UTF-8-encoded character will have binary value 110xxxxx, replace xxxxx with the higher five bytes of the binary code point of the character
  • The second byte will have binary value 10xxxxxx, replace xxxxxx with the lower six bytes of the binary code point of the character

So for ą, value is 0x105 in hex, so 00100000101 in binary.

First byte value is then 11000100 (= 0xC4).

Second byte value is then 10000101 (= 0x85).

Note that this encoding 'technique' works because the characters you want to capitalize have their value (code point) between 0x80 and 0x7FF. It changes depending of how high the value is, see documentation here.


Fixing your code

I have rewritten your to_upper function accoding to what I have written so far:

string to_upper(string nazwa)
{
    for (int i = 0; i < (int)nazwa.size(); i++)
    {
        // Getting the current character we are working with
        char chr1 = nazwa[i];

        // We want to find UTF-8-encoded polish letters here
        // So we are looking for a character that has first three bits set to 110,
        // as all polish letters encoded in UTF-8 are in UTF-8 Class 1 and therefore
        // are two bytes long, the first byte being of binary value 110xxxxx
        if(((chr1 >> 5) & 0b111) != 0b110) {
            nazwa[i] = toupper(chr1); // Do the std toupper here for regular characters
            continue;
        }

        // If we are here, then the character we are dealing with is two bytes long, so get its value.
        // We won't need to check for that second byte during next iteration, so we increment i
        i++;
        char chr2 = nazwa[i];

        // Get the unicode value of the encoded character
        uint16_t fullChr = ((chr1 & 0b11111) << 6) | (chr2 & 0b111111);

        // Get the various conditions to check for lowercase code points
        bool lowercaseIsOdd =  (fullChr >= 0x100 && fullChr <= 0x137) || (fullChr >= 0x14A && fullChr <= 0x177);
        bool lowercaseIsEven = (fullChr >= 0x139 && fullChr <= 0x148) || (fullChr >= 0x179 && fullChr <= 0x17E);
        bool chrIndexIsOdd =   (fullChr % 2) == 1;

        // Depending of whether the code point needs to be odd or even to be lowercase and depending of if the code point
        // is odd or even, decrease it by one to make it uppercase
        if((lowercaseIsOdd && chrIndexIsOdd)
        || (lowercaseIsEven && !chrIndexIsOdd))
            fullChr--;

        // Support for some additional, more commonly used accented letters
        if(fullChr >= 0xE0 && fullChr <= 0xF6)
            fullChr -= 0x20;

        // Re-encode the character point in UTF-8
        nazwa[i-1] = (0b110 << 5) | ((fullChr >> 6) & 0b11111); // We incremented i earlier, so subtract one to edit the first byte of the letter we're encoding
        nazwa[i] = (0b10 << 6) | (fullChr & 0b111111);
    }

    return nazwa;
}

Note: don't forget to #include <cstdint> for uint16_t to work.

Note 2: I have added support for some Latin 1 Supplement (archive) letters because you asked for it in comments. Although we subtract 0x20 from lowercase code points to get the uppercase ones, it is pretty much the same principle as for other letters I have covered in this answer.

I have included lots of comments in my code, please consider reading them for a better understanding.

I have tested it with the string "ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž" and it converted it to "ĀĀĂĂĄĄĆĆĈĈĊĊČČĎĎĐĐĒĒĔĔĖĖĘĘĚĚĜĜĞĞĠĠĢĢĤĤĦĦĨĨĪĪĬĬĮĮİİIJIJĴĴĶĶĸĹĹĻĻĽĽĿĿŁŁŃŃŅŅŇŇŊŊŌŌŎŎŐŐŒŒŔŔŖŖŘŘŚŚŜŜŞŞŠŠŢŢŤŤŦŦŨŨŪŪŬŬŮŮŰŰŲŲŴŴŶŶŸŹŹŻŻŽŽ", so it works perfectly:

int main() {
    string str1 = "ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž";
    string str2 = to_upper(str1);

    printf("str1: %s\n", str1.c_str());
    printf("str2: %s\n", str2.c_str());
}

Picture of a CMD printing the results of the above code

Note: All terminals use UTF-8 by default, Qt labels as well, basically EVERYTHING uses UTF-8, EXCEPT the Windows CMD, so if you are testing the above code on a Windows CMD or Powershell, you need to change them to UTF-8 using command chcp 65001, or by adding a Windows API call to change the CMD encoding when you execute your code.

Note 2: When you write raw strings directly in your code, your compiler will encode them in UTF-8 by default. Which is why my version of the to_upper function works with polish letters directly written in code without further modifications. When I say that EVERYTHING uses UTF-8, I mean it.

Note 3: I kept it to avoid causing problems with your current code, but you use string instead of std::string, implying that you have a using namespace std; somewhere in your code. In which case, please see Why is "using namespace std;" considered bad practice?


Note about the other answers

Please keep in mind that my answer is very specific to your case. It aims to, as you asked for, capitalize polish letters.

Other answers rely on std features which are apparently more universal and work with all languages, so I'd invite you to give them a look.

It's always better to rely on existing features rather than reinventing the wheel, but I think it's also good to have a self-made alternative that might be easier to understand and sometimes is more efficient.

RedStoneMatt
  • 465
  • 3
  • 11
  • "This makes it impossible to store u'ą' in one char for example" People were doing that long before the advent of Unicode, and are still doing it now (see Windows code pages etc). Not that it a good way to tackle the problem in 2023, but it is certainly an existing and to some extent working way. – n. m. could be an AI Aug 02 '23 at 12:27
  • @n.m.couldbeanAI Windows Code Pages are not only considered a VERY bad practice to use, they are also (obviously) windows-only. OP said they were using Qt Creator, implying that the app they are developing might need to be universal, for which UTF-8 is therefore a must. But yes, indeed, various ASCII Extensions added support for this kind of character, but it is very limited as these extensions are still limited by the 0-255 value limit for each characters. So they add characters for value 128 to 255, but there will always be unsupported characters. UTF-8 does not suffer from this at all. – RedStoneMatt Aug 02 '23 at 12:32
  • Point of UTF-8 is that it supports every single Unicode characters, which can in theory have a value as big as we want. For now there are only four ranges of characters, but more can be added if needed one day. Besides, Since as I mentionned in my previous comment, OP uses Qt, UTF-8 is probably the *only* way to get these accented characters to work in OP's specific case because they very likely need the string to stay UTF-8-encoded for Qt to understand it. – RedStoneMatt Aug 02 '23 at 12:36
  • "Point of UTF-8 is that it supports every single Unicode characters" Proceeds with a manually hardcoded capitalisation table for a small part of Unicode, and an incorrect one to boot (look at your ıŚśŜŝŞ). – n. m. could be an AI Aug 02 '23 at 13:02
  • OP asked for polish characters, so I have written my code only for that. Would take an eternity to go through every lowercase character in the Unicode table and convert them to uppercase. You're right for the mistake at the end of the string I included in my answer though! I wasn't careful enough, I'll fix it. Thanks for reporting it, though I'd have perfered it you said so in a less agressive way :) – RedStoneMatt Aug 02 '23 at 13:04
  • "Would take an eternity to go through every lowercase character in the Unicode table and convert them to uppercase" That's why you use tables that are already compiled for you and come preinstalled with your computer. – n. m. could be an AI Aug 02 '23 at 13:08
  • @n.m.couldbeanAI Currently investigating the wrong letters; for some reason, it seems that the value of the second byte of these is wrong when getting them from the string. It is quite weird, gonna take some time to figure out how to fix it. Now besides that, if you know of a better way to answer OP's question, then please write an answer for it, it'll be much better to explain your point than comments. I'm always opened to learn more, so feel free to show us how to use these tables, I didn't even know that was a thing, and this is what answers are for. – RedStoneMatt Aug 02 '23 at 13:18
  • Just don't write your own case conversion. Use the standard one. That's all the fix that is required. – n. m. could be an AI Aug 02 '23 at 13:28
  • Fixed the issue you have reported @n.m.couldbeanAI, the culprit was actually ``std``'s ``toupper``, which turned chars with values ``0x9C`` and ``0x9E`` into ``0x8C`` and ``0x8E`` respectively, therefore breaking my function. I fixed this by having the ``to_upper`` calls done only for regular characters, which also makes the code less complex since it now iterates through the string only once instead of twice. – RedStoneMatt Aug 02 '23 at 13:29
  • "Use the standard one", then please make an answer and show the standard way. I digged through stackoverflow and all I found was ["There are no standard way"](https://stackoverflow.com/a/36898621/9399492) and ["Use some library, uppercase isn't a precise term"](https://stackoverflow.com/a/14095175/9399492). If you know the solution, please give the solution, that is the point of this website. – RedStoneMatt Aug 02 '23 at 13:31
  • I have asked a clarification from the OP and may answer when I get it. Meanwhile, the other answer works better than yours and it uses standard facilities. Try it. – n. m. could be an AI Aug 02 '23 at 13:47
  • @n.m.couldbeanAI I will give a try to the other answer. However, please look at the comments below the other answer, it seems that it won't work for OP. – RedStoneMatt Aug 02 '23 at 13:56
  • It won't work *if* OP needs to read characters from the standard input, which we don't know. Some similar methods will work. I asked for a clarification, but none is coming so far. – n. m. could be an AI Aug 02 '23 at 14:18
  • @RedStoneMatt - Thanks, that mostly solves my problem but, the pair 'ó' and 'Ó' (0xD3, 0xF3) don't work, is there a way to modify your program to swap them or is it a completely different thing? – Greg Aug 02 '23 at 14:44
  • [This](https://godbolt.org/z/6E8E4eY9f) should work on a mac for all languages, except for weird cases like Turkish I. It uses a deprecated C++ facility (codecvt_utf8), if you don't like that, replace with any other utf8 to utf32 and back conversion routines. Also substitute your own locale if desired. – n. m. could be an AI Aug 02 '23 at 14:50
  • @Greg Just before the ``// Re-encode the character point in UTF-8`` comment in my code, add ``if(fullChr >= 0xE0 && fullChr <= 0xF6) fullChr -= 0x20;``, it'll add support for a bunch of [Latin 1 Supplement](https://old.unicode-table.com/en/blocks/latin-1-supplement/) letters, including ``ó``. Those aren't polish-exclusive so I didn't include them in my answer. As I mentionned in one of my comments, having support for all of Unicode's lowercase letters would take an eternity and result in a very big function – RedStoneMatt Aug 02 '23 at 15:10
  • Maybe give a look at @n.m.couldbeanAI's suggestion as well, if it works for you Greg. – RedStoneMatt Aug 02 '23 at 15:13
  • I updated my answer according to your request, Greg. – RedStoneMatt Aug 02 '23 at 15:22
  • Thank you @RedStoneMatt , this resolves all my problems with this. – Greg Aug 02 '23 at 16:30
0

The easiest way to handle this is use wide string. The only trap is proper handling of encoding/locale.

So try this:

#include <algorithm>
#include <iostream>
#include <locale>
#include <string>

int main()
try {
    std::locale cLocale{ "C.UTF-8" };
    std::locale::global(cLocale);

    std::locale sys { "" };
    std::wcin.imbue(sys);
    std::wcout.imbue(sys);

    std::wstring line;
    while (getline(std::wcin, line)) {
        std::transform(line.begin(), line.end(), line.begin(), [&cLocale](auto ch) { return std::toupper(ch, cLocale); });
        std::wcout << line << L'\n';
    }
} catch (const std::exception& e) {
    std::cerr << e.what() << '\n';
}

https://godbolt.org/z/3cKaEeW3z

Now:

  • cLocale defines locale which will be used by standard library when interaction with your program.
  • sys is system locale which defines what kind of encoding should be used on input output streams. Note which overload toupper is used.

Same code should work with std::string and std::cin std::cout only if you use one byte encoding which works for Polish language. In such case you should change string in cLocale to:

#include <algorithm>
#include <iostream>
#include <locale>
#include <string>

int main()
try {
    std::locale cLocale{ ".1250" };
    std::locale::global(cLocale);

    std::locale sys { "" };
    std::cin.imbue(sys);
    std::cout.imbue(sys);

    std::string line;
    while (getline(std::cin, line)) {
        std::transform(line.begin(), line.end(), line.begin(), [&cLocale](auto ch) { return std::toupper(ch, cLocale); });
        std::cout << line << '\n';
    }
} catch (const std::exception& e) {
    std::cerr << e.what() << '\n';
}

Note that this locale name is platform and compiler depended and also system has to be configured to work. Above works on Windows with MSVC (I've test that). Can't demo this since there is no online compiler which supports polish locale.

If multibyte encoding is used then transform will fail since will not be able to process this multibyte characters

Marek R
  • 32,568
  • 6
  • 55
  • 140
  • I get this error: collate_byname::collate_byname failed to construct for .1250 – Greg Aug 02 '23 at 11:56
  • You didn't wrote which OS and compiler you used. I wrote this depends on platform and compiler and test this with MSVC on Windows 10 with configured Polish support. Wide string version should just work. – Marek R Aug 02 '23 at 11:57
  • I work on MacOS in Qt Creator – Greg Aug 02 '23 at 11:58
  • Then type in terminal `locale -a` to see possible locale names. – Marek R Aug 02 '23 at 11:58
  • On my MacOS there is `pl_PL.ISO8859-2` which I thing is equivalent to Windows code page 1250. – Marek R Aug 02 '23 at 12:00
  • Ok do not work on MacOS :(. – Marek R Aug 02 '23 at 12:02
  • On my MacOS machine wide string version fails to, but in strange way. It duplicates characters and add at front some trash, at least duplicated characters are correct upper case. Locale support for C++ sucks. – Marek R Aug 02 '23 at 12:11
  • [There is a slight problem with your approach](https://godbolt.org/z/e7fKzdcvz). It doesn't work with libc++, so MacOS users tied to the compiler shipped with the OS may want to find a different solution. – n. m. could be an AI Aug 02 '23 at 12:35
  • @MarekR It's the fault of libc++. It doesn't have a working wcin/wcout. – n. m. could be an AI Aug 02 '23 at 14:22
0

This should work on most Unix-y systems, except for weird cases like Turkish I and possibly German ß.

#include <clocale>
#include <locale>
#include <iostream>
#include <string>
#include <cwctype>
#include <codecvt>

inline std::wstring stow(const std::string& p)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> wconv;
    return wconv.from_bytes(p);
}

inline std::string wtos(const std::wstring& p)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> wconv;
    return wconv.to_bytes(p);
}


int main()
{
    std::locale loc("");

    // AFAICT the calls below are optional on a Mac 
    // for this particular task but it could be a 
    // good idea to use them anyway
    // std::setlocale(LC_ALL, "");
    // std::locale::global(loc);
    // std::cin.imbue(loc);
    // std::cout.imbue(loc);

    std::string s;
    std::getline(std::cin, s);

    std::wstring w = stow(s);
    for (auto& c: w)
    {
        c = std::toupper(c, loc);
    }

    std::cout << wtos(w) << "\n";
}

Note it uses deprecated C++ facilities for UTF-8 code conversion. If this bothers you, substitute any UTF-8 to UTF-32 and back convertors in stow and wtos. Also feel free to substitute a locale that exists on your system (could be "pl_PL.UTF-8" or similar).,

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243