2

I'm working on a project wherein the case sensitive operations needs to be replaced with case insensitive operations. After doing some reading on this, the type of data to be considered are:

  1. Ascii characters
  2. Non-ascii characters
  3. Unicode characters

Please let me know if I've missed anything in the list.

Do the above need to be handled separately or are there libraries for C++ which can handle them all without concerning the type of data?

Specifically:

  1. Does the boost library provide support for this? If so, are there sample examples or documentation on how to use the APIs?

  2. I learned about IBM's International Components of Unicode (ICU). Is this a library that provides support for case insensitive operations? If so, are there sample examples or documentation on how to use the APIs?

Finally, which among the aforementioned (and other) approaches is better and why?

Thanks!

Based on the comments and answers, I wrote a sample program to understand this better:

#include <iostream>       // std::cout
#include <string>         // std::string
#include <locale>         // std::locale, std::tolower

using namespace std;

void ascii_to_lower(string& str)
{
     std::locale loc;
     std::cout << "Ascii string: " << str;
     std::cout << "Lower case: ";

     for (std::string::size_type i=0; i<str.length(); ++i)
         std::cout << std::tolower(str[i],loc);
     return;
}

void non_ascii_to_lower(void)
{
    std::locale::global(std::locale("en_US.UTF-8"));
    std::wcout.imbue(std::locale());
    const std::ctype<wchar_t>& f = std::use_facet<std::ctype<wchar_t> >(std::local
    std::wstring str = L"Zoë Saldaña played in La maldición del padre Cardona.";

    std::wcout << endl << "Non-Ascii string: " << str << endl;

    f.tolower(&str[0], &str[0] + str.size());

    std::wcout << "Lower case: " << str << endl;

    return;
}

void non_ascii_to_upper(void)
{
    std::locale::global(std::locale("en_US.UTF-8"));
    std::wcout.imbue(std::locale());
    const std::ctype<wchar_t>& f = std::use_facet<std::ctype<wchar_t> >(std::local
    std::wstring str = L"¥£ªÄë";

    std::wcout << endl << "Non-Ascii string: " << str << endl;

    f.toupper(&str[0], &str[0] + str.size());

    std::wcout << "Upper case: " << str << endl;

    return;
}

int main ()
{
    string str="Test String.\n";

    ascii_to_lower(str);
    non_ascii_to_upper();
    non_ascii_to_lower();

    return 0;
}

The output is:

Ascii string: Test String. Lower case: test string.

Non-Ascii string: ▒▒▒▒▒ Upper case: ▒▒▒▒▒

Non-Ascii string: Zo▒ Salda▒a played in La maldici▒n del padre Cardona. Lower case: zo▒ salda▒a played in la maldici▒n del padre cardona.

The non-ascii string, though seems to get converted to upper and lower case, some of the text is not visible in the output. Why is this?

On the whole, does the sample code look ok?

Maddy
  • 1,319
  • 3
  • 22
  • 37
  • 4
    You can convert ASCII and Non-ASCII to utf-32 and treat them among with unicode strings itself as std::wstring. And you can use std::to_lower to remove case factor. – LibertyPaul Mar 24 '16 at 14:18
  • 1
    @LibertyPaul Do you mean `std::u32string`? – Simple Mar 24 '16 at 14:19
  • @Simple wstring is basic_string which is 32 bit, but yes, u32string is a good advice. – LibertyPaul Mar 24 '16 at 14:21
  • 1
    @LibertyPaul wchart_t is not 32 bits on windows. – Christophe Mar 24 '16 at 14:21
  • @Christophe As said at cplusplus.com about wchar_t: "Type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales." so it doesnt matter ;) – LibertyPaul Mar 24 '16 at 14:23
  • @LibertyPaul Define "doesn't matter". `wchar_t` on Windows is 16-bit and is used for their UTF-16 API. It can't hold a Unicode codepoint. – Simple Mar 24 '16 at 14:25
  • @LibertyPaul it's more complex than that. Windows works with utf16-LE / ucs2 subset of unicode, but an app could have to read files using larger encodings. – Christophe Mar 24 '16 at 14:27
  • @Simple If it has size of 16 bit than utf32 is not supported on this platform. – LibertyPaul Mar 24 '16 at 14:28
  • @LibertyPaul what? The Windows API is in UTF-16 and `wchar_t` is 16 bits. That doesn't mean you can't use UTF-32 in your program. – Simple Mar 24 '16 at 14:34
  • 2
    @LibertyPaul When it was originally created over 15 years ago, wchar_t was able to handle all defined code points, but not anymore. The real issue with this request is that many languages have contextual requirements for case conversion - e.g. capital sigma 0x3A3 in Greek should become either 0x03C3 or 0x03C2, depending on whether it is at the end of a word or not. What is the use-case for this - is it an international application that must support any character set/language, or something else? – Matt Jordan Mar 24 '16 at 14:48
  • 1
    The unicode case is covered by [this thread](http://stackoverflow.com/questions/17991431/convert-a-unicode-string-in-c-to-upper-case) – M.M Mar 25 '16 at 05:47
  • @Maddy about your output issue: which compiler do you use ? Which encoding did you use for your source file ? Which os do you use ? And which encoding is defined for your console ? – Christophe Mar 25 '16 at 09:54
  • Compiler: gcc version 4.4.5 20110214 (Red Hat 4.4.5-6) (GCC); OS: Scientific Linux release 6.1 (Carbon); console encoding (using 'echo $LANG') is en_US.UTF-8; file encoding (using 'file -bi file.cc') is charset=utf-8 – Maddy Mar 25 '16 at 10:19

2 Answers2

2

I'm a little surprised by this question. A simple search of boost case conversion came up with as the first entry: Usage - 1.41.0 - Boost which has a entry on case conversion.

A search of stl case conversion has an entry tolower - C++ Reference - Cplusplus.com which also shows how to convert using the STL.

To do a case insensitive search, convert both to lower or upper case and compare.

Example from code from boost.org:

string str1("HeLlO WoRld!");
to_upper(str1); // str1=="HELLO WORLD!"

Example from Cplusplus.com:

// tolower example (C++)
#include <iostream>       // std::cout
#include <string>         // std::string
#include <locale>         // std::locale, std::tolower

int main ()
{
  std::locale loc;
  std::string str="Test String.\n";
  for (std::string::size_type i=0; i<str.length(); ++i)
    std::cout << std::tolower(str[i],loc);
  return 0;
}

For ASCII characters (characters with an ASCII value < 128), there should be no problem. If you are using MCBS, you may need to use locals for code pages. Unicode should have no problems AFAIK.

As to Matt Jordan's comment:

The real issue with this request is that many languages have contextual requirements for case conversion - e.g. capital sigma 0x3A3 in Greek should become either 0x03C3 or 0x03C2, depending on whether it is at the end of a word or not.

I would be pleasantly surprised if the boost library supported this. You would have to test it and report bugs if they don't. There's no reference on their page to say if they do any contextual case conversions. A work around might be to test for both converting to lowercase and comparing, and converting to uppercase and comparing. If either is true, then there's a match, which should work for 99.99% of the cases.

An interesting paper by Bjarne Stroustrup, found here, is a good read regarding Locales.

Adrian
  • 10,246
  • 4
  • 44
  • 110
1

You have already a very good answer about boost. Here some additional remarks:

Character encoding

ASCII characters are encoded on 7 bits. ISO 8859-1 and windows-1252 extend the ASCII with a limited set of international characters by making use of the 8th bit.

Unicode standard extends ASCII even further and is defined on 32 bit. Several encodings are available: UTF32 on 32 bits is the easiest (1 unicode character = 1 char), but UTF16 and UTF8 encodings allow to store Unicode text with a variable sized encoding using smaller chars.

To make it even more difficult, different operating systems use different conventions. On linux, wchart_t is in general a 32 bits wide char used for unicode, and wstring is a string based on wchar_t, and char use UTF8 encoding. On windows wchar_t is defined as 16 bits, because windows' native encoding is UCS-2 (a subset of unicode), and char is generally understood as win1252.

Dealing with character size and encoding

So to come back on your problem, there are two aspects to consider:

  • the storage - If you want a one size fits it all, you could use char32_t that can hold as well ASCII as any unicode character. And use a basic_string<char32_t> or u32string for strings, which support all the functions you are used to handle for normal strings. Or you can you could use normal strings and adhere to UTF 8 everywhere.

  • the encoding - how your app interprets the value contained in your char, and to perform such operations as converting to lower or upper case. This is defined in the applicable locale.

Fortunately, the C++ standard library can cope with all these aspects:

  • locale help to manage uppercase & lowercase conversion and testing (e.g. isupper(), isalpha(), ...) using the appropriate encoding
  • codecvt allows to convert between various encondings

Additional libraries

The ICU library doesn't seem to provide case insensitive comparison. It provides support for text processing, for example, iterating through text elements, using collation ordering and so on.

I'd suggest to keep using standard library or boost, due to the wide support these enjoy.

Christophe
  • 68,716
  • 7
  • 72
  • 138
  • Thanks! I've added a sample code to understand this better. – Maddy Mar 25 '16 at 05:30
  • 1
    @Maddy the first part of your locale corresponds to en=english language and US=usa country settings such as currency. The english language only knows a-z and no special char for upper conversion. Set spanish as language to get propper conversion for Zoë. – Christophe Mar 25 '16 at 09:49
  • And to find the supported locale on your linux system: http://www.cyberciti.biz/faq/how-to-set-locales-i18n-on-a-linux-unix/ – Christophe Mar 25 '16 at 15:23
  • I'm basically looking for a library or APIs which provides the case-insensitive operations look seamless by hiding the details of whether the text is ascii or non-ascii and the type of encoding involved. I should be able to pass the string in hand (say, non-ascii) to the API, and it should perform the suggested operation on the string without the user bothering to set the locale and stuff. Do we have such a thing? – Maddy Mar 29 '16 at 06:06
  • Use case: Suppose my application receives a string initially as "Zoë Saldaña" and it should store it without bothering to know the type of the string. Subsequently, if it receives a string "zoë saLdañA", the comparison between this and the first string should result in a match, merely by calling an API. So, looking for a set of APIs that stores the string in the appropriate format and to later retrieve it and do a case-insensitive comparison. – Maddy Mar 29 '16 at 06:15