13

How do I write a std::codecvt facet? I'd like to write ones that go from UTF-16 to UTF-8, which go from UTF-16 to the systems current code page (windows, so CP_ACP), and to the system's OEM codepage (windows, so CP_OEM).

Cross-platform is preferred, but MSVC on Windows is fine too. Are there any kinds of tutorials or anything of that nature on how to correctly use this class?

Billy ONeal
  • 104,103
  • 58
  • 317
  • 552
  • You might take a look at [the example in the libstdc++ manual](http://gcc.gnu.org/onlinedocs/libstdc++/manual/codecvt.html). – James McNellis Jun 06 '10 at 20:47
  • For locales and facets the only book I know that goes into any detail is http://www.angelikalanger.com/iostreams.html but it's only got a few pages on codecvt specifically. –  Jun 06 '10 at 21:16
  • 3
    I can't believe that nobody seems to know squat about this class in the Standard library -- particularly given how potentially useful it can be... – Billy ONeal Jun 06 '10 at 22:38
  • 1
    2James: that example is awkward. Conversion direction can't be stored in locale that way. – Basilevs Jul 20 '10 at 05:15

2 Answers2

12

I've written one based on iconv. It can be used on windows or on any POSIX OS. (You will need to link with iconv obviously).

Enjoy

The answer for the "how to" question is to follow the codecvt reference. I was not able to find any better instructions in the Internet two years ago.

Important notices

  • theoretically there is no need for such work. codecvt_byname should be enough on any standard supporting platform. But in reality there are some compilers that don't support or badly support this class. There is also a difference in interfaces of codecvt_byname on different compilers.
  • my working example is implemented with state template parameter of codecvt. Always use standard mbstate type there as this is the only way to use your codecvt with standard iostream classes.
  • std::mbstate_t type can't be used as a pointer on 64bit platforms in a cross-platform way.
  • stateless conversions work for short strings, but may fail if you try to convert a data chunk greater that streambuf internal buffer size (UTF is essentially stateful encoding)
Basilevs
  • 22,440
  • 15
  • 57
  • 102
  • 1
    +1 -- I was not aware that `codecvt_byname` existed, and it turns out my compiler actually supports such a thing correctly. (Who knew?) Not accepting this yet because it isn't a direct answer to the question but if/when the bounty expires you'll get the points anyway. – Billy ONeal Jun 07 '10 at 09:35
4

The problem with this std::codecvt is it's a solution looking for a problem. Or rather, the problem it's trying to solve is unsolvable, so anybody trying to use it as a solution is going to be very disappointed.

If you don't know which character set your input or output is, then std::codecvt isn't ever going to be able to help you. Conversely, if you do know which character sets you're using, then you can trivially convert between them with a single function call. Wrapping that function call in a complicated mess of templates doesn't change those fundamentals.

...and that's why nobody uses std::codecvt. I recommend you just do what everybody else does, and pretend it never happened.

apenwarr
  • 10,838
  • 6
  • 47
  • 58
  • 3
    I know exactly what codepage and such I'm using. I want to be able to specify which codepage to use to iostreams. And the only way to do that is with `std::codecvt`. Sure, I can convert a block of text between code pages without a problem, but there's no way to say, "format this integer to be 8 spaces wide, fill the blanks with zeros" without a big mess of `std::wstringstream` s. I'd rather just be able to make iostreams natively convert to the correct codepage given it already has a facility for doing so. -1 for not answering the question. – Billy ONeal Jun 07 '10 at 05:57
  • 4
    As for "Nobody uses `std::codecvt`", can you explain why conversion facets for Unicode are being added in C++0x, and http://www.boost.org/doc/libs/1_43_0/libs/serialization/doc/codecvt.html ? – Billy ONeal Jun 07 '10 at 06:01
  • 2
    Probably in the hopes that people will *start* using std::codecvt once it's no longer useless. – apenwarr Jun 07 '10 at 21:46