How to get C++ std::string from Little-Endian UTF-16 encoded bytes

Question

I have a 3rd party device that communicates with my Linux box over a proprietary communication protocol that isn't well documented. Some packets convey "strings" that, after reading this Joel On Software article, appears to be in UTF16 Little-Endian encoding. In other words, what I have on my Linux box after receipt of such packets are things like

// The string "Out"
unsigned char data1[] = {0x4f, 0x00, 0x75, 0x00, 0x74, 0x00, 0x00, 0x00};

// The string "°F"
unsigned char data2[] = {0xb0, 0x00, 0x46, 0x00, 0x00, 0x00};

As I understand it, I cannot treat these as an std::wstring because on Linux a wchar_t is 4 bytes. I do, however, have one thing going for me in that my Linux box is also Little-Endian. So, I believe I need to use something like std::codecvt_utf8_utf16<char16_t>. However, even after reading the documentation, I cannot figure out how to actually go from an unsigned char[] to an std::string. Can someone please help?

http://www.cplusplus.com/reference/string/u16string/ – Hans Passant Nov 08 '19 at 14:05 — Hans Passant, Nov 08 '19 at 14:05

Victor Gubin · Answer 1 · 2019-11-08T14:47:26.433

2

If you wish to use std::codcvt (which is deprecated since C++ 17) you can wrap your UTF-16 text, and then convert it to UTF-8, if needed.

i.e.

// simply cast raw data for constructor, since we known that char 
// is actually 'byte' array from network API
std::u16string u16_str( reinterpret_cast<const char16_t*>(data2) );

// UTF-16/char16_t to UTF-8
std::string u8_conv = std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t>{}.to_bytes(u16_str);

edited Nov 08 '19 at 14:47

answered Nov 08 '19 at 14:16

Victor Gubin

2,782
10
24

Thank you for providing the answer! I did not know that std::codcvt is deprecated in C++17, which is, in fact, what I am using. What is the correct way to do the same operation in C++17? – Paul Grinberg Nov 08 '19 at 14:33
1

It seems like no standard library functionality for this any longer :( So propable Iconv (part of C library on Unix) is good choose. In [my library](https://github.com/incoder1/IO/blob/master/include/text.hpp) I wrapped all iconv work around `transcode` function. – Victor Gubin Nov 08 '19 at 14:39
@PaulGrinberg It's kind of an unfortunate situation, because there isn't any: https://stackoverflow.com/q/42946335/256138 – rubenvb Nov 08 '19 at 14:41

score 1 · Accepted Answer · answered Nov 11 '19 at 21:45

1

For the sake of completeness, here's the simplest iconv based conversion I came up with

#include <iconv.h>

auto iconv_eng = ::iconv_open("UTF-8", "UTF-16LE");
if (reinterpret_cast<::iconv_t>(-1) == iconv_eng)
{
  std::cerr << "Unable to create ICONV engine: " << strerror(errno) << std::endl;
}
else
{
  // src            a char * to utf16 bytes
  // src_size       the maximum number of bytes to convert
  // dest           a char * to utf8 bytes to generate
  // dest_size      the maximum number of bytes to write
  if (static_cast<std::size_t>(-1) == ::iconv(iconv_eng, &src, &src_size, &dest, &dest_size))
  {
    std::cerr << "Unable to convert from UTF16: " << strerror(errno) << std::endl;
  }
  else
  {
    std::string utf8_str(src);
    ::iconv_close(iconv_eng);
  }
}

answered Nov 11 '19 at 21:45

Paul Grinberg

1,184
14
37

You've forgotten UTF-8 destination memory buffer size detecting. `static std::size_t utf8_buff_size(const char16_t* ustr, std::size_t size) noexcept` in header file. – Victor Gubin Nov 13 '19 at 13:16
@VictorGubin - I'm not sure I understand your comment. Can you please clarify? As a point of possible relevance, my utf16 bytes are NULL terminated - that is, the last two bytes are 0x00 0x00 – Paul Grinberg Nov 13 '19 at 14:37
1

how would you know - the size of destination UTF-8 buffer in bytes (`&dest, &dest_size`) ? If it is too small iconv will failed with "no more room" error. You can simply multiply UTF-16 string length on 2, and in the same time when source string contains mostly latin1 e.g. single byte characters - you will waste a lot of memory (50%). So the best option - calculate destination buffer size. – Victor Gubin Nov 13 '19 at 15:10

How to get C++ std::string from Little-Endian UTF-16 encoded bytes

2 Answers2