1

Compiling Boost 1.59.0 using the default settings on OS X uses the iconv library. When using things like boost::locale::to_upper(), with UTF8 characters, iconv causes results like "GRüßEN” for inputs like "grüßEN”. As you can see, some characters don't get upper cased correctly.

I read the fix is to use ICU instead of iconv and so I set off to build Boost with ICU. The method I follow, for my use case, is the following:

  1. Download the unix tar ball (not the ZIP, that has CR/LF line endings and will not work). Un-tar it.
  2. Modify the code from file boost/libs/filesystem/src/operations.cpp at line 1414 to read # if 0 so that the fallback code is always executed. Otherwise I get a linking error telling you that fchmodat is not available in OS X 10.9.
  3. Download ICU 56.1 at http://site.icu-project.org/download/56#TOC-ICU4C-Download. Un-tar it.
  4. cd to ``icu/source```.
  5. Run ./configure --enable-static --disable-shared CXXFLAGS="-std=c++14" --prefix="<path to install ICU>"
  6. Run gnumake && gnumake install
  7. cd to boost_1_59_0/.
  8. Run ./bootstrap.sh toolset=darwin macosx-version=10.11 macosx-version-min=10.8 --with-icu=<path where icu was installed>
  9. Run ./b2 toolset=darwin --without-mpi optimization=speed cxxflags="-arch x86_64 -fvisibility=hidden -fvisibility-inlines-hidden -std=c++14 -stdlib=libc++ -ftemplate-depth=512" linkflags="-stdlib=libc++" --reconfigure boost.locale.iconv=off boost.locale.icu=on -sICU_PATH=<path to my icu install dir> -link=static stage.

Now this correctly compiles a version of the Boost libraries but when using this version, boost::locale::to_upper() now completely skips UTF8 characters and returns "GREN” for inputs like "grüßEN”.

Test code looks like this:

static boolean defaultLocaleWasInitialized = false;
...
void String::p_initDefaultLocale(void)
{
    boost::locale::generator gen;
    std::locale defaultLocale = gen("");
    std::locale::global(defaultLocale);
    std::wcout.imbue(defaultLocale);
}
...
String::Pointer String::uppperCaseString(void) const
{
    if (!defaultLocaleWasInitialized) {
        String::p_initDefaultLocale();
        defaultLocaleWasInitialized = true;
    }
    auto result = boost::locale::to_upper(*this);
    auto newString = String::stringWith(result.c_str());
    return newString;
}
...
TEST(Base_String, UpperCaseString_StringWithLowerCaseCharacters_ReturnsOneWithUpperCaseCharacters)
{
    auto test = String::stringWith("Mp3 grüßEN");
    auto result = test->uppperCaseString();
    ASSERT_STREQ("MP3 GRÜSSEN", result->toUTF8());
}

Any suggestions as to where I'm going wrong?

Didier Malenfant
  • 729
  • 1
  • 10
  • 25
  • `iconv` converts strings between different encodings—it won't do case conversion. You should include code for a small test program showing the problem. – roeland Nov 10 '15 at 22:25
  • Added the code to my question. Trying to find where I read that ICU was required for proper conversion. Does the string convert correctly using your boost libraries? – Didier Malenfant Nov 11 '15 at 04:45
  • Having non-ascii characters in a string literal, like in `"Mp3 grüßEN"` is undefined behaviour. You have to ensure in some other way your string contains the characters you expect it to contain, eg. by UTF-8 encoding that string and coding the resulting bytes like this: **ü** → `"\xc3\xbc"`. And any library you use has to somehow know what encoding you used. – roeland Nov 11 '15 at 04:59
  • Recompiled ICU using ```-DU_CHARSET_IS_UTF8=1```. I get the same result (skipped character) when using the string literal ```"GR \xC3\xBC en"``` as a test. – Didier Malenfant Nov 11 '15 at 06:00
  • I'm not sure the input is the issue. If I copy the code taken from http://stackoverflow.com/questions/22331487/string-conversion-with-boost-locale-different-behaviour-on-windows-and-linux I get ```grüßen vs GREN gren gren``` – Didier Malenfant Nov 11 '15 at 06:15
  • Can you verify whether it's merely skipping the character or interpreting it as a control code, which isn't displayed? – Jonathan Howard Nov 12 '15 at 19:41
  • Good suggestion. How would you go about doing this, since I can't trace thru the compiled boost/ICU libs? – Didier Malenfant Nov 12 '15 at 20:22
  • On linux your code right completely fine, and give correct result: MP3 GRÜSSEN, on my system boost compiled in such way: https://gitweb.gentoo.org/repo/gentoo.git/tree/dev-libs/boost/boost-1.56.0-r1.ebuild – fghj Nov 14 '15 at 16:19
  • It seems like this only compiles boost and uses ICU as shared library if present. It could be that your copy of ICU is correctly compiled and my static version isn't. It's hard for me to replicate this locally though since I'm trying to compile both from scratch. – Didier Malenfant Nov 14 '15 at 19:25

1 Answers1

-1

AFAIK, this is the correct Boost behavior. Boost has limitations in its localizations, for example it considers ß a single code point and cannot uppercase it to SS. Hence your code is not going wrong, it is simply a problem with the Boost library: the exact behavior when it comes to some UTF8 characters is platform dependent.

Boost ß limitations: boost to_upper function of string_algo doesn't take into account the locale and boost::algorithm::to_upper/to_lower ok for utf8? boost::locale not necessary? (especially comments).

Platform dependence: string conversion with boost locale: different behaviour on windows and linux

If you want it to be translated properly, you will probably need another library. Localization is hard!

personjerry
  • 1,045
  • 8
  • 28
  • You may be correct regarding boost::locale without ICU but ICU's behavior is not platform dependent and my understanding is that boost's behavior with ICU is not platform dependent. boot::locale's own documentation mentions that the behavior for upper-casing ```ß``` is supported http://www.boost.org/doc/libs/1_48_0/libs/locale/doc/html/conversions.html Regardless, it's safe to say that ignoring characters (as I am seeing) is not correct behavior so my question remains, where am I going wrong in compiling these libraries from scratch. – Didier Malenfant Nov 19 '15 at 18:50