How to make languages-friendly function to lower?

Question

I want one function 'to lower' (from word) to work correctly on two languages, for example, english and russian. What should I do? Should I use std::wstring for it, or I can go along with std::string? Also I want it to be cross-platform and don't reinvent the wheel.

This is a complex question. Make sure that you know about locales and that you have read this: http://www.joelonsoftware.com/articles/Unicode.html — Alexandre C., Apr 24 '14 at 19:15
In the end, to get it right, you are forced to go for unicode strings, in an encoding of your choice (prefer UTF-8). Changing case (lower, upper, title, folded) is not properly defined for single unicode codepoints. Still, there are many languages which have conflicting definitions for these transformations. — Deduplicator, Apr 24 '14 at 19:22
So I should use unicode and what else? I know exactly what languages I'm going to have. One of two. It couldn't help some-how? — Ava_Katushka, Apr 24 '14 at 19:28
Standard exemples: in greek upper case sigma has two lower case possibilities depending on context. Also lower case I has a dot in French but not dot in Turkish (and upper case i has a dot in Turkish), etc. — Alexandre C., Apr 24 '14 at 19:29
If ypu want to roll your own, it might help to reduce the tables, might even allow you to combine them (I didn't test those two). Unless you are prevented from doing so, use ICU as Alexandre posted. — Deduplicator, Apr 24 '14 at 19:29
uh. It seems a bit difficult. How to start using ICU in my project? Any quick guide for dummies? — Ava_Katushka, Apr 24 '14 at 19:36
Using the boost interface to ICU (aka Boost.Locale) may be simpler, but you won't escape installing ICU. — Alexandre C., Apr 24 '14 at 19:48
@Ava_Katushka: I believe [this function](http://www.icu-project.org/apiref/icu4c/classicu_1_1UnicodeString.html#afdccf26252579d296828832e25418e32) is relevant to what you want to do. — Alexandre C., Apr 24 '14 at 20:33

score 6 · Accepted Answer · edited May 23 '17 at 12:05

The canonical library for this kind of things is ICU:

http://site.icu-project.org/

There is also a boost wrapper:

http://www.boost.org/doc/libs/1_55_0/libs/locale/doc/html/index.html

See also this question: Is there an STL and UTF-8 friendly C++ Wrapper for ICU, or other powerful Unicode library

Make sure first that you understand the concept of locales, and that you have a firm grasp of what Unicode and more generally coding systems is all about.

Some good reads for a quick start:

http://joelonsoftware.com/articles/Unicode.html

http://en.wikipedia.org/wiki/Locale

score 0 · Answer 2 · answered Apr 26 '14 at 13:56

I think this solution is ok. I'm not sure it suits for every situation, but it's quite possible.

#include <locale>
#include <codecvt>
#include <string>

std::string toLowerCase (const std::string& word) {
    std::wstring_convert<std::codecvt_utf8<wchar_t> > conv;
    std::locale loc("en_US.UTF-8");
    std::wstring wword = conv.from_bytes(word);
    for (int i = 0; i < wword.length(); ++i) {
       wword[i] = std::tolower(word[i], loc);
    }
   return conv.to_bytes(wword);
}

How to make languages-friendly function to lower?

2 Answers2