1

I have a unicode string (a series of bytes) with an initial BOM (it usually is UTF-16 little-endian) and I need to convert this to an ASCII std::string.

I tried using this solution but it didn't work on visual studio 2015.

How can I convert that series of bytes? Target system would be Windows.

Community
  • 1
  • 1
Dean
  • 6,610
  • 6
  • 40
  • 90
  • 2
    the answer you reference is not for converting to ascii, it converts to UTF-8 (though handling a bom requires a different codecvt facet and some additional configuration). It also should work just fine with VS2015; what problem did you encounter? – bames53 Nov 30 '15 at 17:38
  • Adding to what @bames53 said, any string that only contains ASCII characters can be converted to UTF-8 and will be compatible; if it contains characters outside of ASCII then you can't do a sensible conversion anyway. – Mark Ransom Dec 03 '15 at 15:37

1 Answers1

3

This should work on visual studio. This function should never be inline because it allocates temporary variable sized buffer on the stack.

std::string toMultibyte(const wchar_t* src, UINT codepage = CP_ACP)
{
  int wcharCount = static_cast<int>(std::wcslen(src));
  int buffSize = WideCharToMultiByte(codepage, 0, src, wcharCount, NULL, 0, NULL, NULL);
  char* buff = static_cast<char*>(_alloca(buffSize));
  WideCharToMultiByte(codepage, 0, src, wcharCount, buff, buffSize, NULL, NULL);
  return std::string(buff, buffSize);
}

If your compiler doesn't support _alloca(), or you have some prejustice against this function, you may use this approach.

template<std::size_t BUFF_SIZE = 0x100>
  std::string toMultibyte(const wchar_t* src, UINT codepage = CP_ACP)
{
  int wcharCount = static_cast<int>(std::wcslen(src));
  int buffSize = WideCharToMultiByte(codepage, 0, src, wcharCount, NULL, 0, NULL, NULL);
  if (buffSize <= BUFF_SIZE) {
    char buff[BUFF_SIZE];
    WideCharToMultiByte(codepage, 0, src, wcharCount, buff, buffSize, NULL, NULL);
    return std::string(buff, buffSize);
  } else {
    auto buff = std::make_unique<char[]>(buffSize);
    WideCharToMultiByte(codepage, 0, src, wcharCount, buff.get(), buffSize, NULL, NULL);
    return std::string(buff.get(), buffSize);
  }
}
Minor Threat
  • 2,025
  • 1
  • 18
  • 32
  • This code can be made more efficient by getting rid of the implicit character counting logic, and getting rid of `_alloc()` (which is not safe or portable, anyway). Pass the actual `wchar_t*` length to `WideCharToMultiByte()`, then allocate the `std::string` to the reported length and let the second `WideCharToMultiByte()` call fill it. No null-terminator handling is required. And don't forget to strip off the BOM before passing the data to `toMultibyte()`, or `WideCharToMultiByte()` will convert it and store the result in the `std::string` as well. – Remy Lebeau Dec 03 '15 at 02:25
  • @Remy Lebeau: AFAIU I have to count these wide chars anyway, no matter implicitly inside `WideCharToMultiByte()` or explicitly via `wcsnlen()`. Portability is not an ussue in the code which requires `` for `WideCharToMultiByte()` and will possibly break due to warning promoted to error because of `wcsnlen()` usage instead of Microsoft approved `wcsnlen_s()`. And I have some prejustice against modifying `std::string`'s contents via direct access to its internal buffer. It was prohibited in the C++98 standard, at least. – Minor Threat Dec 03 '15 at 14:16
  • @Remy Lebeau: PS, I've got the point: this code counts the wide chars _twice_. I'll update the post. – Minor Threat Dec 03 '15 at 14:36
  • I've overlooked the _third_ character counting inside `std::string`'s constructor. It is possible to get rid of it, too. – Minor Threat Dec 03 '15 at 15:39
  • 1
    @Dean: This function **should never** be inline because it allocates temporary variable sized buffer on the stack, though. – Minor Threat Dec 03 '15 at 16:02
  • @MinorThreat: by portability, I was referring to compiler portability rather than platform portability. Not all compilers support `_alloc()`, or implement it differently. Besides, like I said, it can be a dangerous function anyway and should be avoided. Since a `std::string` has to be dynamically allocated anyway, better to just rely on heap memory instead of stack memory to receive the converted characters. – Remy Lebeau Dec 03 '15 at 17:44
  • @Remy Lebeau: this `std::string` instance being returned is eligible for a simple RVO optimisation. Visual Studio does that even in Debug builds, I've checked that. So I have a _single heap allocation_ performed in char range constructor, that was my goal. It is possible to introduce `#ifdef _MSC_VER` and fallback to `new[]` operator, if needed. – Minor Threat Dec 03 '15 at 19:09
  • If you don't want to use a local `std::string` variable so you can utilize RVO, at least use `std::vector` instead of `new char[]` to capture the output, and then copy that data to the `std:::string` on the `return` statement. – Remy Lebeau Dec 03 '15 at 19:30
  • `std::unique_ptr` officially supports safe handling of `new[]` arrays, so this is not the case where `new[]` is inherently unsafe or I have to use `std::vector`. My goal is a single heap allocation per single UTF16->CP_ACP conversion, and I'll get at minimum two heap allocations if I will make use of `std::vector` or any other wrapper around `::operator new[]`. – Minor Threat Dec 03 '15 at 19:47
  • @Remy Lebeau: Added no-alloca variant. – Minor Threat Dec 03 '15 at 20:07
  • Your no-alloc variant is performing 2 allocations when `buffSize > BUFF_SIZE`, and you are allocating the stack buffer even though iti s not going to be used. Also, you can replace `buff + buffSize` with just `buffSize` when creating the `std::string`. – Remy Lebeau Dec 03 '15 at 20:18