5

I have trouble transforming a string to lowercase with the tolower() function in C++. With normal strings, it works as expected, however special characters are not converted successfully.

How I use my function:

string NotLowerCase = "Grüßen";
string LowerCase = "";
for (unsigned int i = 0; i < NotLowerCase.length(); i++) {
    LowerCase += tolower(NotLowerCase[i]);
    }

For example:

  1. Test -> test
  2. TeST2 -> test2
  3. Grüßen -> gr????en
  4. (§) -> ()

3 and 4 are not working as expected as you can see

How can I fix this issue? I have to keep the special chars, but as lowercase.

TVA van Hesteren
  • 1,031
  • 3
  • 20
  • 47
  • 3
    Do you realise that this is *impossible* to get right due to the fact that `ß` translates to `SS`, whereas `SS` may be translated to either `ß` or `ss` depending on the context? – Christian Hackl Mar 14 '17 at 17:03
  • Yes, I understand and have already deleted my comment, great answers guys, thx keep it up. p.s. what is a safe language to use when this doesn't occur and just stay with the normal 'word' like it was originally? e.g. en_US.iso88591? – TVA van Hesteren Mar 14 '17 at 17:13
  • 5
    Why do people keep calling perfectly normal letters "special characters"? – n. m. could be an AI Mar 14 '17 at 18:22

3 Answers3

7

The sample code (below) from tolower shows how you fix this; you have to use something other than the default "C" locale.

#include <iostream>
#include <cctype>
#include <clocale>

int main()
{
    unsigned char c = '\xb4'; // the character Ž in ISO-8859-15
                              // but ´ (acute accent) in ISO-8859-1 

    std::setlocale(LC_ALL, "en_US.iso88591");
    std::cout << std::hex << std::showbase;
    std::cout << "in iso8859-1, tolower('0xb4') gives "
              << std::tolower(c) << '\n';
    std::setlocale(LC_ALL, "en_US.iso885915");
    std::cout << "in iso8859-15, tolower('0xb4') gives "
              << std::tolower(c) << '\n';
}

You might also change std::string to std::wstring which is Unicode on many C++ implementations.

wstring NotLowerCase = L"Grüßen";
wstring LowerCase;
for (auto&& ch : NotLowerCase) {
    LowerCase += towlower(ch);
    }

Guidance from Microsoft is to "Normalize strings to uppercase", so you might use toupper or towupper instead.

Keep in mind that a character-by-character transformation might not work well for some languages. For example, using German as spoken in Germany, making Grüßen all upper-case turns it into GRÜESSEN (although there is now a capital ). There are numerous other "problems" such a combining characters; if you're doing real "production" work with strings, you really want a completely different approach.

Finally, C++ has more sophisticated support for managing locales, see <locale> for details.

Ðаn
  • 10,934
  • 11
  • 59
  • 95
  • Mind you, this works for ISO-8859-*, but NOT for Unicode. And since it's tagged "htmlspecialcharacters", unicode is a fair assumption. – MSalters Mar 14 '17 at 16:35
  • Indeed, I would like to support unicode, since I will have to process many different languages and therefore multiple character sets – TVA van Hesteren Mar 14 '17 at 16:38
  • 1
    Wauw, this toupper -> towupper did it. (I modified it to lower of course, but it seems to work for now) thx for your support! – TVA van Hesteren Mar 14 '17 at 16:47
  • 2
    @Ðаn: Indeed. To be fair, saying `tolower` is already making assumptions about character sets. Chinese is the classical counter-example. ISO-8859 describes a collection of 8 bit character sets, which together cover most of the alphabets for which lowercase makes sense. But for UTF-8, things suddenly are a lot more complex. And don't get me started about locale-specific case rules; I only have 600 characters per comment. One short example to remember, though: ß=>SS. Even in 8859-1, that can't be done with char toupper(char). The length of strings changes with uppercasing! – MSalters Mar 14 '17 at 16:48
  • 2
    @TVAvanHesteren You cannot really support multiple languages unless you support their individual quirks on a case by case basis. You can support *characters* that are used in multiple languages, but only if you don't manipulate these characters in any way. Changing a word to uppercase and then back to lowercase can be [deadly](http://www.theinquirer.net/inquirer/news/1017243/cellphone-localisation-glitch). – n. m. could be an AI Mar 14 '17 at 18:49
  • 1
    So, how do you suggest to fix this or counter the problem? – TVA van Hesteren Mar 14 '17 at 19:47
2

I think the most portable way to do this is to use the user selected locale which is achieved by setting the locale to "" (empty string).

std::locale::global(std::locale("")); 

That sets the locale to whatever was in use where the program was run and it effects the standard character conversion routines (std::mbsrtowcs & std::wcsrtombs) that convert between multi-byte and wide-string characters.

Then you can use those functions to convert from the system/user selected multi-byte characters (such as UTF-8) to system standard wide character codes that can be used in functions like std::tolower that operate on one character at a time.

This is important because multi-byte character sets like UTF-8 can not be converted using single character operations like with std::tolower().

Once you have converted the wide string version to upper/lower case it can then be converted back to the system/user multibyte character set for printing to the console.

// Convert from multi-byte codes to wide string codes
std::wstring mb_to_ws(std::string const& mb)
{
    std::wstring ws;
    std::mbstate_t ps{};
    char const* src = mb.data();

    std::size_t len = 1 + mbsrtowcs(0, &src, 3, &ps);

    ws.resize(len);
    src = mb.data();

    mbsrtowcs(&ws[0], &src, ws.size(), &ps);

    if(src)
        throw std::runtime_error("invalid multibyte character after: '"
            + std::string(mb.data(), src) + "'");

    ws.pop_back();

    return ws;
}

// Convert from wide string codes to multi-byte codes
std::string ws_to_mb(std::wstring const& ws)
{
    std::string mb;
    std::mbstate_t ps{};
    wchar_t const* src = ws.data();

    std::size_t len = 1 + wcsrtombs(0, &src, 0, &ps);

    mb.resize(len);
    src = ws.data();

    wcsrtombs(&mb[0], &src, mb.size(), &ps);

    if(src)
        throw std::runtime_error("invalid wide character");

    mb.pop_back();

    return mb;
}

int main()
{
    // set locale to the one chosen by the user
    // (or the one set by the system default)
    std::locale::global(std::locale(""));

    try
    {
        string NotLowerCase = "Grüßen";

        std::cout << NotLowerCase << '\n';

        // convert system/user multibyte character codes
        // to wide string versions
        std::wstring ws1 = mb_to_ws(NotLowerCase);
        std::wstring ws2;

        for(unsigned int i = 0; i < ws1.length(); i++) {
            // use the system/user locale
            ws2 += std::tolower(ws1[i], std::locale("")); 
        }

        // convert wide string character codes back
        // to system/user multibyte versions
        string LowerCase = ws_to_mb(ws2);

        std::cout << LowerCase << '\n';
    }
    catch(std::exception const& e)
    {
        std::cerr << e.what() << '\n';
        return EXIT_FAILURE;
    }
    catch(...)
    {
        std::cerr << "Unknown exception." << '\n';
        return EXIT_FAILURE;
    }

    return EXIT_SUCCESS;
}

Code not heavily tested

Galik
  • 47,303
  • 4
  • 80
  • 117
-6

use ASCII

string NotLowerCase = "Grüßen";
string LowerCase = "";
for (unsigned int i = 0; i < NotLowerCase.length(); i++) {
    if(NotLowerCase[i]<65||NotLowerCase[i]>122)
    {
        LowerCase+='?';
    }
    else
        LowerCase += tolower(NotLowerCase[i]);
}
  • 2
    I need the special characters in lowercase as stated in te question. This is just replacing them with a 'valid' question mark which is not requested. Thanks for you input though – TVA van Hesteren Mar 14 '17 at 16:40