How to correctly use codecvt_byname (C++17) to encode latin1, and then UTF-8 for use in JSON

Question

I am (desperately) trying to prepare a byte array (copied from a PLC, where they construct the "string" as a byte array, locale/encoding is German, French, etc) for use in nlohmann::json, while preserving the source encoding (latin1).

Using this toy example, the compiler complains about ~codecvt() and ~codecvt_byname() being protected:

/usr/bin/g++   -O3 -DNDEBUG -std=c++17 -MD -MT CMakeFiles/encod.dir/src/encod.cpp.o -MF CMakeFiles/encod.dir/src/encod.cpp.o.d -o CMakeFiles/encod.dir/src/encod.cpp.o -c /src/encod.cpp
In file included from /usr/include/c++/12/locale:43,
                 from /src/encod.cpp:1:
/usr/include/c++/12/bits/locale_conv.h: In instantiation of ‘std::__detail::_Scoped_ptr<_Tp>::~_Scoped_ptr() [with _Tp = std::codecvt<wchar_t, char, __mbstate_t>]’:
/usr/include/c++/12/bits/locale_conv.h:309:7:   required from here
/usr/include/c++/12/bits/locale_conv.h:241:26: error: ‘virtual std::codecvt<wchar_t, char, __mbstate_t>::~codecvt()’ is protected within this context
  241 |         ~_Scoped_ptr() { delete _M_ptr; }
      |                          ^~~~~~~~~~~~~
In file included from /usr/include/c++/12/bits/locale_facets_nonio.h:2067,
                 from /usr/include/c++/12/locale:41:
/usr/include/c++/12/bits/codecvt.h:429:7: note: declared protected here
  429 |       ~codecvt();
      |       ^
In file included from /usr/include/c++/12/memory:76,
                 from /src/encod.cpp:6:
/usr/include/c++/12/bits/unique_ptr.h: In instantiation of ‘void std::default_delete<_Tp>::operator()(_Tp*) const [with _Tp = std::codecvt_byname<wchar_t, char, __mbstate_t>]’:
/usr/include/c++/12/bits/unique_ptr.h:396:17:   required from ‘std::unique_ptr<_Tp, _Dp>::~unique_ptr() [with _Tp = std::codecvt_byname<wchar_t, char, __mbstate_t>; _Dp = std::default_delete<std::codecvt_byname<wchar_t, char, __mbstate_t> >]’
/src/encod.cpp:18:152:   required from here
/usr/include/c++/12/bits/unique_ptr.h:95:9: error: ‘std::codecvt_byname<_InternT, _ExternT, _StateT>::~codecvt_byname() [with _InternT = wchar_t; _ExternT = char; _StateT = __mbstate_t]’ is protected within this context
   95 |         delete __ptr;
      |         ^~~~~~~~~~~~
/usr/include/c++/12/bits/codecvt.h:722:7: note: declared protected here
  722 |       ~codecvt_byname() { }
      |       ^

#include <locale>
#include <codecvt>
#include <vector>
#include <string>
#include <iostream>
#include <memory>

int main() {
    std::vector<uint8_t> v = {0x68, 0xe4, 0x6c, 0x6c, 0x6f}; // hällo

    std::string my_string(v.begin(), v.end());

    // Convert to wide string
    std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_conv;
    std::wstring wide_str = utf8_conv.from_bytes(my_string);

    // Convert wide string to Latin1 string
    std::unique_ptr<std::codecvt_byname<wchar_t, char, std::mbstate_t>> 
            latin1_cvt(new std::codecvt_byname<wchar_t, char, std::mbstate_t>("iso-8859-1"));
    std::wstring_convert<std::codecvt<wchar_t, char, std::mbstate_t>> latin1_conv(latin1_cvt.get());
    std::string latin1_str = latin1_conv.to_bytes(wide_str);


    std::cout << latin1_str << std::endl;

    return 0;
}

How can I make this work? Should I better use ICU for this scenario, ie am I holding (using) it wrong?

FWIW, `std::codecvt` has been deprecated in C++20 and clears the way for it to be removed in a future standard. If you want to future proof the code then I would suggest using ICU to handle the different encodings. — NathanOliver, Mar 31 '23 at 14:02
@RemyLebeau It depends on which one you are talking about. [`std::codecvt_byname`](https://en.cppreference.com/w/cpp/locale/codecvt_byname) was deprecated in C++20, [`std::codecvt_utf8`](https://en.cppreference.com/w/cpp/locale/codecvt_utf8) (and the other utf versions) were deprecated in C++17 — NathanOliver, Mar 31 '23 at 16:50
thanks a lot @RemyLebeau - I was not aware at all. This is very good to know. I will not use it then — juwalter, Mar 31 '23 at 18:08
thank you @NathanOliver - then I will go with ICU instead, since we are looking forward to C++20 — juwalter, Mar 31 '23 at 18:10
Locale and encoding are separate things. What is "French encoding" is anyone's guess. Please be precise in describing of what you have and what you want to achieve. JSON encoded in latin1 is highly unusual and not recommended, prefer UTF-8. — n. m. could be an AI, Apr 01 '23 at 09:46
@NathanOliver ICU is a 30 MiB library with a considerable learning curve. A Latin1 <-> UTF-8 transcoder can be done in half an hour and a couple of KiB. — n. m. could be an AI, Apr 01 '23 at 09:57

Remy Lebeau · Answer 1 · 2023-04-01T17:03:08.213

Note that most of the std::codecvt_... types are deprecated, so you should not be using them anymore. However, they do still work for existing implementations.

That said, you are simply using std::codecvt_byname wrong, which is why you are getting the compiler error.

Unlike the std::codecvt_utf... classes, which are meant to be usable by themselves and thus have public destructors, std::codecvt_byname is a locale-managed facet and so it has a protected destructor, which means you cannot destroy a std::codecvt_byname object directly. Locale-managed facets are owned by std::locale, and it will destroy any facet that is assigned to it. This is mentioned in the ~codecvt documentation on cppreference.com:

https://en.cppreference.com/w/cpp/locale/codecvt/%7Ecodecvt

Destructs a std::codecvt facet. This destructor is protected and virtual (due to base class destructor being virtual). An object of type std::codecvt, like most facets, can only be destroyed when the last std::locale object that implements this facet goes out of scope or if a user-defined class is derived from std::codecvt and implements a public destructor.

Which means, you can't use std::codecvt_byname as the direct type held by a std::unique_ptr. But, as mentioned above, you can derive a new class from std::codecvt_byname and give it a public destructor. This is even demonstrated in the std::wstring_convert documentation on cppreference.com:

https://en.cppreference.com/w/cpp/locale/wstring_convert/wstring_convert

#include <locale>
#include <utility>
#include <codecvt>
 
// utility wrapper to adapt locale-bound facets for wstring/wbuffer convert
template<class Facet>
struct deletable_facet : Facet
{
    using Facet::Facet; // inherit constructors
    ~deletable_facet() {}
};
 
int main()
{
    // UTF-16le / UCS4 conversion
    std::wstring_convert<
         std::codecvt_utf16<char32_t, 0x10ffff, std::little_endian>
    > u16to32;
 
    // UTF-8 / wide string conversion with custom messages
    std::wstring_convert<std::codecvt_utf8<wchar_t>> u8towide("Error!", L"Error!");

    // GB18030 / wide string conversion facet
    typedef deletable_facet<std::codecvt_byname<wchar_t, char, std::mbstate_t>> F;
    std::wstring_convert<F> gbtowide(new F("zh_CN.gb18030"));
}

https://en.cppreference.com/w/cpp/locale/wstring_convert/%7Ewstring_convert

#include <locale>
#include <utility>
#include <codecvt>
 
// utility wrapper to adapt locale-bound facets for wstring/wbuffer convert
template<class Facet>
struct deletable_facet : Facet
{
    template<class ...Args>
    deletable_facet(Args&& ...args) : Facet(std::forward<Args>(args)...) {}
    ~deletable_facet() {}
};
 
int main()
{
    // GB18030 / UCS4 conversion, using locale-based facet directly
    // typedef std::codecvt_byname<char32_t, char, std::mbstate_t> gbfacet_t;
    // Compiler error: "calling a protected destructor of codecvt_byname<> in ~wstring_convert"
    // std::wstring_convert<gbfacet_t> gbto32(new gbfacet_t("zh_CN.gb18030"));

    // GB18030 / UCS4 conversion facet using a facet with public destructor
    typedef deletable_facet<std::codecvt_byname<char32_t, char, std::mbstate_t>> gbfacet_t;
    std::wstring_convert<gbfacet_t> gbto32(new gbfacet_t("zh_CN.gb18030"));
} // destructor called here

Note the use of deletable_facet<std::codecvt_byname<...>> in both examples.

Also, note that std::wstring_convert takes ownership of the conversion facet that you give it, so you cannot use std::unique_ptr to manage its lifetime.

Thus, in your example, use this instead:

// Convert wide string to Latin1 string
using latin1_cvt = deletable_facet<std::codecvt_byname<wchar_t, char, std::mbstate_t>>;
std::wstring_convert<latin1_cvt> latin1_conv(new latin1_cvt("iso-8859-1"));
std::string latin1_str = latin1_conv.to_bytes(wide_str);

thank you for the explanation. it makes sense from a technical point of view, ie. the pure mechanics = solve the issue about the protected deconstructor; I still wonder about the rationale of forcing this from the standard library/committee - is it just to alert/notify about the deprecation? — juwalter, Apr 01 '23 at 10:04
btw- can compile now, but gives `std::range_error what(): string_convert::from_byte` - (I am on Linux) and found valid locale names with `locale -a` and tried multiple, including `de_DE.utf8` and `de_DE.iso88591` - same error. I suspect `std::wstring wide_str = utf8_conv.from_bytes(my_string);` is wrong? — juwalter, Apr 01 '23 at 10:12
@juwalter "*is it just to alert/notify about the deprecation?*" - no, they were always setup to work this way. Since C++14, there is a `[[deprecated]]` attribute to warn users about deprecations. "*can compile now, but gives `std::range_error`*" - that should be posted as a separate question. — Remy Lebeau, Apr 01 '23 at 17:05

How to correctly use codecvt_byname (C++17) to encode latin1, and then UTF-8 for use in JSON

1 Answers1