5

I have a simple program that tests converting between wchar_t and char using a series of locales passed to it on the command line. It outputs a list of the conversions that fail by printing out the locale name and the string that failed to convert.

I'm building it using clang and libc++. My understanding is that libc++'s named locale support is provided by the xlocale library on OS X.

I'm seeing some unexpected failures, as well as some instances where conversion should fail, but doesn't.

Here's the program.

#warning call this program like: "locale -a | ./a.out" or pass \
locale names valid for your platform, one per line via standard input

#include <iostream>
#include <codecvt>
#include <locale>
#include <array>

template <class Facet>
class usable_facet : public Facet {
public:
    // FIXME: use inheriting constructors when available
    // using Facet::Facet;
    template <class ...Args>
    usable_facet(Args&& ...args) : Facet(std::forward<Args>(args)...) {}
    ~usable_facet() {}
};

int main() {
    std::array<std::wstring,11> args = {L"a",L"é",L"¤",L"€",L"Да",L"Ψ",L"א",L"আ",L"✈",L"가",L""};

    std::wstring_convert<usable_facet<std::codecvt_utf8<wchar_t>>> u8cvt; // wchar_t uses UCS-4/UTF-32 on this platform

    int convert_failures = 0;
    std::string line;
    while(std::getline(std::cin,line)) {
        if(line.empty())
            continue;

        using codecvt = usable_facet<std::codecvt_byname<wchar_t,char,std::mbstate_t>>;
        std::wstring_convert<codecvt> convert(new codecvt(line));

        for(auto const &s : args) {
            try {
                convert.to_bytes(s);
            } catch (std::range_error &e) {
                convert_failures++;
                std::cout << line << " : " << u8cvt.to_bytes(s) << '\n';
            }
        }
    }

    std::cout << std::string(80,'=') << '\n';
    std::cout << convert_failures << " wstring_convert to_bytes failures.\n";
}

Here are some examples of correct output

en_US.ISO8859-1 : €
en_US.US-ASCII : ✈

Here's an example of output that is not expected

en_US.ISO8859-15 : €

The euro character does exist in the ISO 8859-15 charset and so this should not be failing.

Here are examples of output that I expect but do not receive

en_US.ISO8859-15 : ¤
en_US.US-ASCII : ¤

This is the currency symbol that exists in ISO 8859-1 but was removed and replaced with the euro symbol in ISO 8859-15. This conversion should not be succeeding, but no error is being signaled. When examining this case further I find that in both cases '¤' is being converted to 0xA4, which is the ISO 8859-1 representation of '¤'.

I'm not using xlocale directly, only indirectly via libc++. Is xlocale on Mac OS X simply broken with bad locale definitions? Is there a way to fix it? Or are the issues I'm seeing a result of something else?

bames53
  • 86,085
  • 15
  • 179
  • 244

2 Answers2

3

I suspect you are seeing problems with the xlocale system. A bug report would be most appreciated!

Howard Hinnant
  • 206,506
  • 52
  • 449
  • 577
  • Still looks broken in 10.8 :( Maybe there's some way to get at the xlocale data and hack a fix in manually? – bames53 Jul 27 '12 at 17:07
  • It turns out that UTF-32 is not in fact used as the wchar_t encoding by all locales on OS X, which is quite unfortunate. – bames53 Sep 23 '15 at 23:06
-1

I don't know why you're expecting wchar_t to be UTF-32 or where you heard that "OS X's convention that wchar_t is UTF-32." That is certainly incorrect. wchar_t are only 16 bits wide.

See http://en.wikipedia.org/wiki/Wide_character for more information about wchar_t.

Rob Lu
  • 1
  • 1
  • 3
    `wchar_t` is 32 bits wide on OS X and most unix operating systems, not 16. – bames53 Feb 20 '13 at 17:41
  • 1
    … a fact which Wikipedia mentions, alongside the tidbit that it could also be 8 bits on other platforms. C++11 adds `char16_t` and `char32_t` to resolve this, but that's unrelated. – Potatoswatter Feb 27 '13 at 15:09