Inserting narrow character string to std::basic_ostream

Question

According to cppref, there is an operator << overload for std::basic_ostream<wchar_t> that accepts const char*. It seems that the convert operation simply widens each char into a wchar_t. That is, the number of wide characters converted (inserted) is equal to the number of narrow characters. So here comes a problem. The narrow character string may be encoding international characters, say Chinese characters using GB2312. Further assume that sizeof(wchar_t) is 2 and uses UTF16 encoding. Then how should this naive character-wise converting method work?

I would say that it *won't* work. If you need to convert between different encoding and characters widths, you should look at a library which handles it, like [ICU](http://site.icu-project.org/). — Some programmer dude, Oct 02 '15 at 13:16
@JoachimPileborg Then how does wide character logging work in Boost.Log? Please see http://www.boost.org/doc/libs/1_59_0/libs/log/doc/html/log/tutorial/wide_char.html — Lingxi, Oct 02 '15 at 13:19
I can't say anything for Boost log, but it might simply do proper conversion somewhere? — Some programmer dude, Oct 02 '15 at 13:20
@JoachimPileborg I don't think so. It just imbued a customized locale. Look for the `operator <<` overload for `severity_level` in the linked page. — Lingxi, Oct 02 '15 at 13:24

score 0 · Accepted Answer · edited May 23 '17 at 12:03

I have just checked in Visual Studio 2015 and you are right. The chars are only widened to wchar_ts without any conversion. It seems to me that you will have to convert the narrow character string into wide character string yourself. There several ways how you can do it, some of it have been already suggested.

Here I propose that you can use pure C++ facilities to do it, assuming your C++ compiler and standard library is complete enough (Visual Studio, or GCC on Linux (and only there)):

void clear_mbstate (std::mbstate_t & mbs);

void
towstring_internal (std::wstring & outstr, const char * src, std::size_t size,
    std::locale const & loc)
{
    if (size == 0)
    {
        outstr.clear ();
        return;
    }

    typedef std::codecvt<wchar_t, char, std::mbstate_t> CodeCvt;
    const CodeCvt & cdcvt = std::use_facet<CodeCvt>(loc);
    std::mbstate_t state;
    clear_mbstate (state);

    char const * from_first = src;
    std::size_t const from_size = size;
    char const * const from_last = from_first + from_size;
    char const * from_next = from_first;

    std::vector<wchar_t> dest (from_size);

    wchar_t * to_first = &dest.front ();
    std::size_t to_size = dest.size ();
    wchar_t * to_last = to_first + to_size;
    wchar_t * to_next = to_first;

    CodeCvt::result result;
    std::size_t converted = 0;
    while (true)
    {
        result = cdcvt.in (
            state, from_first, from_last,
            from_next, to_first, to_last,
            to_next);
        // XXX: Even if only half of the input has been converted the
        // in() method returns CodeCvt::ok. I think it should return
        // CodeCvt::partial.
        if ((result == CodeCvt::partial || result == CodeCvt::ok)
            && from_next != from_last)
        {
            to_size = dest.size () * 2;
            dest.resize (to_size);
            converted = to_next - to_first;
            to_first = &dest.front ();
            to_last = to_first + to_size;
            to_next = to_first + converted;
            continue;
        }
        else if (result == CodeCvt::ok && from_next == from_last)
            break;
        else if (result == CodeCvt::error
            && to_next != to_last && from_next != from_last)
        {
            clear_mbstate (state);
            ++from_next;
            from_first = from_next;
            *to_next = L'?';
            ++to_next;
            to_first = to_next;
        }
        else
            break;
    }
    converted = to_next - &dest[0];

    outstr.assign (dest.begin (), dest.begin () + converted);
}

void
clear_mbstate (std::mbstate_t & mbs)
{
    // Initialize/clear mbstate_t type.
    // XXX: This is just a hack that works. The shape of mbstate_t varies
    // from single unsigned to char[128]. Without some sort of initialization
    // the codecvt::in/out methods randomly fail because the initial state is
    // random/invalid.
    std::memset (&mbs, 0, sizeof (std::mbstate_t));
}

This function is part of log4cplus library and it works. It uses the codecvt facet to do the conversion. You have to give it appropriately set up locale.

Visual studio might have issues giving you appropriately set up locale for GB2312. You might have to use _setmbcp() to for it to work. See "double byte character sequence conversion issue in Visual Studio 2015" for details.

Inserting narrow character string to std::basic_ostream

1 Answers1