Stumped with Unicode, Boost, C++, codecvts

Question

In C++, I want to use Unicode to do things. So after falling down the rabbit hole of Unicode, I've managed to end up in a train wreck of confusion, headaches and locales.

But in Boost I've had the unfortunate problem of trying to use Unicode file paths and trying to use the Boost program options library with Unicode input. I've read whatever I could find on the subjects of locales, codecvts, Unicode encodings and Boost.

My current attempt to get things to work is to have a codecvt that takes a UTF-8 string and converts it to the platform's encoding (UTF-8 on POSIX, UTF-16 on Windows), I've been trying to avoid wchar_t.

The closest I've actually gotten is trying to do this with Boost.Locale, to convert from a UTF-8 string to a UTF-32 string on output.

#include <string>
#include <boost/locale.hpp>
#include <locale>

int main(void)
{
  std::string data("Testing, 㤹");

  std::locale fromLoc = boost::locale::generator().generate("en_US.UTF-8");
  std::locale toLoc   = boost::locale::generator().generate("en_US.UTF-32");

  typedef std::codecvt<wchar_t, char, mbstate_t> cvtType;
  cvtType const* toCvt = &std::use_facet<cvtType>(toLoc);

  std::locale convLoc = std::locale(fromLoc, toCvt);

  std::cout.imbue(convLoc);
  std::cout << data << std::endl;

  // Output is unconverted -- what?

  return 0;
}

I think I had some other kind of conversion working using wide characters, but I really don't know what I'm even doing. I don't know what the right tool for the job is at this point. Help?

score 12 · Accepted Answer · answered Dec 09 '11 at 06:56

Okay, after a long few months I've figured it out, and I'd like to help people in the future.

First of all, the codecvt thing was the wrong way of doing it. Boost.Locale provides a simple way of converting between character sets in its boost::locale::conv namespace. Here's one example (there's others not based on locales).

#include <boost/locale.hpp>
namespace loc = boost::locale;

int main(void)
{
  loc::generator gen;
  std::locale blah = gen.generate("en_US.utf-32");

  std::string UTF8String = "Tésting!";
  // from_utf will also work with wide strings as it uses the character size
  // to detect the encoding.
  std::string converted = loc::conv::from_utf(UTF8String, blah);

  // Outputs a UTF-32 string.
  std::cout << converted << std::endl;

  return 0;
}

As you can see, if you replace the "en_US.utf-32" with "" it'll output in the user's locale.

I still don't know how to make std::cout do this all the time, but the translate() function of Boost.Locale outputs in the user's locale.

As for the filesystem using UTF-8 strings cross platform, it seems that that's possible, here's a link to how to do it.

Here is a link that doesn't go to the index page (for that last link) http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/default_encoding_under_windows.html — elegant dice, Sep 14 '12 at 06:32
You say using codecvt is bad, but why Boost uses codecvt as the convert method in its file system, precisely, path class? — hakunami, Oct 17 '12 at 01:01
Hmm.... I answered this in '11, so I'm not exactly sure what my mindset was. I suppose the codecvt way that /I/ posted was the wrong way of doing it. Boost.Locale itself uses codecvts to interface with Boost.Filesystem. — Jookia, Oct 18 '12 at 15:20

Yakov Galka · Answer 2 · 2011-10-23T11:01:18.903

3

  std::cout.imbue(convLoc);
  std::cout << data << std::endl;

This does no conversion, ~~since it uses codecvt<char, char, mbstate_t> which is a no-op~~. The only standard streams that use codecvt are file-streams. std::cout is not required to perform any conversion at all.

To force Boost.Filesystem to interpret narrow-strings as UTF-8 on windows, use boost::filesystem::imbue with a locale with a UTF-8 ↔ UTF-16 codecvt facet. Boost.Locale has an implementation of the latter.

edited Oct 23 '11 at 11:01

answered Oct 22 '11 at 13:06

Yakov Galka

70,775
16
139
220

@Jookia: It's unclear to me what exactly do you want. You're trying to output a string with unknown encoding (writing a string literal containing unicode characters is already non-portable) to cout, which doesn't have a standardized encoding and you're free to assume whatever encoding it is. I always assume that [`cout` is UTF-8](http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful) and let the user configure his console to use UTF-8 or open the files with editors that understand UTF-8. – Yakov Galka Oct 23 '11 at 07:40
My code isn't the entire problem, I'm having trouble dealing with Boost, C++, locales and Unicode in general. I want to use UTF-8 strings in my program, and translate the user's locale from/to UTF-8, for use with cout and cin, which I can't figure out how to do. But then I want to use UTF-8 and Boost, which seems to be impossible as it uses wide strings, which don't help at all. – Jookia Oct 23 '11 at 09:28
@Jookia: Again, your question is too vague: "I'm having trouble ... in general"! "I want to use UTF-8 strings in my program" go on! That's what I do. "user's locale from/to UTF-8 for use with cout and cin" why? Just assume it's UTF-8 and let those who use legacy encoding change their encoding to UTF-8. On windows you are meant to use wcin and wcout to read/write unicode data, but it's going to be non-portable as you'll have to maintain two versions of your code, one that uses wcout on windows and one that uses cout on non-windows. You don't want this, do you? – Yakov Galka Oct 23 '11 at 10:09
"But then I want to use UTF-8 and Boost, which seems to be impossible" in some parts of boost it's possible but inconvenient. Some parts of boost don't support unicode on windows at all (Boost.Interprocess), some do it wrongly (Boost.Program_Options) and some are painful for cross-platform code (Boost.Filesystem). "as it uses wide strings" no. Some parts of boost use narrow-char (Boost.Interprocess), some use both (Boost.Filesystem). The problem is that those who use the narrow-string assume the native encoding instead of UTF-8 by default, imposing the burden on you. – Yakov Galka Oct 23 '11 at 10:09
[There is a fight in boost community on deprecating the wide-char and assuming all narrow-strings are UTF-8](http://lists.boost.org/Archives/boost/2011/01/174850.php). We (proponents of UTF-8) are currently loosing since there's no much demand, and most boost developers (e.g. author of Filesystem) live in Unix world and not facing the real-world trouble of writing Unicode-correct production code portable among windows and linux. If you want to change the status-quo, open the discussion in boost-mailing list again. – Yakov Galka Oct 23 '11 at 10:13
So I should drop the idea of Unicode and stick with good ol' ASCII, seeing as you can't really use Unicode portably? Or maybe I should just stick with UNIX? – Jookia Oct 24 '11 at 04:09
@Jookia: Why do you give up? Yes, you can't always do it portably with existing libraries, and you need to write boiler-plate code to do it with others. The choice of supporting it within the boundaries of the possible is up to you. – Yakov Galka Oct 25 '11 at 09:29
I give up because I don't understand what the problem is fully, and from that I can't actually deduce a possible solution. – Jookia Oct 25 '11 at 17:09

score 3 · Answer 3 · edited Mar 17 '13 at 08:41

3

The Boost filesystem iostream replacement classes work fine with UTF-16 when used with Visual C++.

However, they do not work (in the sense of supporting arbitrary filenames) when used with g++ in Windows - at least as of Boost version 1.47. There is a code comment explaining that; essentially, the Visual C++ standard library provides non-standard wchar_t based constructors that Boost filesystem classes make use of, but g++ does not support these extensions.

A workaround is to use 8.3 short filenames, but this solution is a bit brittle since with old Windows versions the user can turn off automatic generation of short filenames.

Example code for using Boost filesystem in Windows:

#include "CmdLineArgs.h"        // CmdLineArgs
#include "throwx.h"             // throwX, hopefully
#include "string_conversions.h" // ansiOrFillerFrom( wstring )

#include <boost/filesystem/fstream.hpp>     // boost::filesystem::ifstream
#include <iostream>             // std::cout, std::cerr, std::endl
#include <stdexcept>            // std::runtime_error, std::exception
#include <string>               // std::string
#include <stdlib.h>             // EXIT_SUCCESS, EXIT_FAILURE
using namespace std;
namespace bfs = boost::filesystem;

inline string ansi( wstring const& ws ) { return ansiWithFillersFrom( ws ); }

int main()
{
    try
    {
        CmdLineArgs const   args;
        wstring const       programPath     = args.at( 0 );

        hopefully( args.nArgs() == 2 )
            || throwX( "Usage: " + ansi( programPath ) + " FILENAME" );

        wstring const       filePath        = args.at( 1 );
        bfs::ifstream       stream( filePath );     // Nice Boost ifstream subclass.
        hopefully( !stream.fail() )
            || throwX( "Failed to open file '" + ansi( filePath ) + "'" );

        string line;
        while( getline( stream, line ) )
        {
            cout << line << endl;
        }
        hopefully( stream.eof() )
            || throwX( "Failed to list contents of file '" + ansi( filePath ) + "'" );

        return EXIT_SUCCESS;
    }
    catch( exception const& x )
    {
        cerr << "!" << x.what() << endl;
    }
    return EXIT_FAILURE;
}

edited Mar 17 '13 at 08:41

Smi

13,850
9
56
64

answered Oct 22 '11 at 13:22

Cheers and hth. - Alf

142,714
15
209
331

1

I'm trying to do it cross-platform. – Jookia Oct 22 '11 at 13:29
1

@Jookia: ok. i was assuming you restricted yourself to UTF-8 locale *nix (and Mac), and Windows. supporting general cross-platform is I think not something one man can do. good luck! – Cheers and hth. - Alf Oct 22 '11 at 13:34
1

@Jookia: This answer is a proof of some of my claims below. To use boost.filesystem with unicode on windows you must use wstring, on non-windows you definitely want to use string. This is how boost.filesystem does *not* hide the platform differences and does *not* make writing cross-platform code simpler. I must admit that in case of boost.fs you can change the way it interprets narrow-strings to UTF-8, thus making it easier to port the code. However, the point is that boost could make our life much easier by just changing two lines in boost.fs. And it's a pity they don't want to. – Yakov Galka Oct 23 '11 at 10:31
@ybungalobill: note that boost filesystem does not support general filenames with g++ in windows, and that that problem can't be fixed by using utf-8 encoding everywhere. – Cheers and hth. - Alf Oct 23 '11 at 11:08
@AlfP.Steinbach Excuse me, what do you mean *exactly*? If it's compiled against BOOST_POSIX_API then indeed it does not. If it's compiled against BOOST_WINDOWS_API then the only part that doesn't is the boost::filesystem::i/ofstream. They could implement the later through implementing the filebuf using winodows API directly (I did this). – Yakov Galka Oct 23 '11 at 11:17
@ybungalobill: i mean exactly what i wrote, which was pretty exact. e.g., currently boost::filesystem::ifstream does not manage to open a file with a name like [π.recipe], when it's used with g++. well unless one sets the ANSI codepage to some encoding that supports π, then forsaking some other characters. using utf-8 doesn't fix that. the upshot is that neither the standard library nor boost supports general filenames in general in Windows, except for with Visual C++. – Cheers and hth. - Alf Oct 23 '11 at 12:36
@AlfP.Steinbach "boost filesystem does not support general filenames" can mean, e.g. "boost::filesystem::remove() doesn't support general filenames" among all other possible interpretations. Now it's clear. Anyway, the point of UTF-8 is not to magically support unicode but rather to provide a uniform portable interface among systems that have such support. – Yakov Galka Oct 23 '11 at 13:58
@ybungalobill: hm. i think a "uniform portable interface" is good idea, but utf8 as common encoding clashes with the principle of not paying for what you don't need or use. the native encoding used should instead be abstracted away by the uniform portable interface. anyway you first need something that *works* on all relevant platforms, and standard filestreams don't with g++ in Windows. however, a set of wrappers like **[this](http://codepad.org/oVH0lkbm)** go a long way toward supporting std streams in Windows. it's only when short names are turned off in Windows, that it fails. hth., – Cheers and hth. - Alf Oct 24 '11 at 08:41
@AlfP.Steinbach: What are you paying if you assume UTF-8? If you're talking about the conversion to/from UTF-8, then I say that any portable solution will involve some kind of conversion. This already starts when C gives preference to narrow-char versus wide-char (`""` is shorter than `L""`). It continues when you have existing standardized interfaces that you can't change the API/ABI but can change the semantics (e.g. the only way to make std::exception::what unicode-aware on all platforms is to standardize it to be UTF-8, or whatever UTF that fits within the char on that platform). – Yakov Galka Oct 25 '11 at 09:55
So if you really want to analyze the cost of assuming UTF-8, you must analyze the usage patterns and figure out where you need to do the conversions. Moreover, there's no such thing as native encoding on windows. You have *two* encodings. UTF-16 and some deprecated non-unicode aware 'ANSI' encoding. Furthermore, you always talk about boost::fs::fstreams, which is a rather minor part of this library. The rest of the functionality will work unchanged. – Yakov Galka Oct 25 '11 at 09:57
"...you first need something that works on all relevant platforms..." disagree. If it doesn't work on one platform where this can't work because it's *impossible*, is it an excuse for scrapping it even if it makes things easier on all other platforms? – Yakov Galka Oct 25 '11 at 09:57

Stumped with Unicode, Boost, C++, codecvts

3 Answers3

Linked