Converting a unicode (with BOM) string to ASCII std::string

Question

I have a unicode string (a series of bytes) with an initial BOM (it usually is UTF-16 little-endian) and I need to convert this to an ASCII std::string.

I tried using this solution but it didn't work on visual studio 2015.

How can I convert that series of bytes? Target system would be Windows.

the answer you reference is not for converting to ascii, it converts to UTF-8 (though handling a bom requires a different codecvt facet and some additional configuration). It also should work just fine with VS2015; what problem did you encounter? — bames53, Nov 30 '15 at 17:38
Adding to what @bames53 said, any string that only contains ASCII characters can be converted to UTF-8 and will be compatible; if it contains characters outside of ASCII then you can't do a sensible conversion anyway. — Mark Ransom, Dec 03 '15 at 15:37

Minor Threat · Accepted Answer · 2015-12-03T22:46:47.933

3

This should work on visual studio. This function should never be inline because it allocates temporary variable sized buffer on the stack.

std::string toMultibyte(const wchar_t* src, UINT codepage = CP_ACP)
{
  int wcharCount = static_cast<int>(std::wcslen(src));
  int buffSize = WideCharToMultiByte(codepage, 0, src, wcharCount, NULL, 0, NULL, NULL);
  char* buff = static_cast<char*>(_alloca(buffSize));
  WideCharToMultiByte(codepage, 0, src, wcharCount, buff, buffSize, NULL, NULL);
  return std::string(buff, buffSize);
}

If your compiler doesn't support _alloca(), or you have some prejustice against this function, you may use this approach.

template<std::size_t BUFF_SIZE = 0x100>
  std::string toMultibyte(const wchar_t* src, UINT codepage = CP_ACP)
{
  int wcharCount = static_cast<int>(std::wcslen(src));
  int buffSize = WideCharToMultiByte(codepage, 0, src, wcharCount, NULL, 0, NULL, NULL);
  if (buffSize <= BUFF_SIZE) {
    char buff[BUFF_SIZE];
    WideCharToMultiByte(codepage, 0, src, wcharCount, buff, buffSize, NULL, NULL);
    return std::string(buff, buffSize);
  } else {
    auto buff = std::make_unique<char[]>(buffSize);
    WideCharToMultiByte(codepage, 0, src, wcharCount, buff.get(), buffSize, NULL, NULL);
    return std::string(buff.get(), buffSize);
  }
}

edited Dec 03 '15 at 22:46

answered Nov 30 '15 at 23:33

Minor Threat

2,025
1
18
32

This code can be made more efficient by getting rid of the implicit character counting logic, and getting rid of `_alloc()` (which is not safe or portable, anyway). Pass the actual `wchar_t*` length to `WideCharToMultiByte()`, then allocate the `std::string` to the reported length and let the second `WideCharToMultiByte()` call fill it. No null-terminator handling is required. And don't forget to strip off the BOM before passing the data to `toMultibyte()`, or `WideCharToMultiByte()` will convert it and store the result in the `std::string` as well. – Remy Lebeau Dec 03 '15 at 02:25
@Remy Lebeau: AFAIU I have to count these wide chars anyway, no matter implicitly inside `WideCharToMultiByte()` or explicitly via `wcsnlen()`. Portability is not an ussue in the code which requires `` for `WideCharToMultiByte()` and will possibly break due to warning promoted to error because of `wcsnlen()` usage instead of Microsoft approved `wcsnlen_s()`. And I have some prejustice against modifying `std::string`'s contents via direct access to its internal buffer. It was prohibited in the C++98 standard, at least. – Minor Threat Dec 03 '15 at 14:16
@Remy Lebeau: PS, I've got the point: this code counts the wide chars _twice_. I'll update the post. – Minor Threat Dec 03 '15 at 14:36
I've overlooked the _third_ character counting inside `std::string`'s constructor. It is possible to get rid of it, too. – Minor Threat Dec 03 '15 at 15:39
1

@Dean: This function **should never** be inline because it allocates temporary variable sized buffer on the stack, though. – Minor Threat Dec 03 '15 at 16:02
@MinorThreat: by portability, I was referring to compiler portability rather than platform portability. Not all compilers support `_alloc()`, or implement it differently. Besides, like I said, it can be a dangerous function anyway and should be avoided. Since a `std::string` has to be dynamically allocated anyway, better to just rely on heap memory instead of stack memory to receive the converted characters. – Remy Lebeau Dec 03 '15 at 17:44
@Remy Lebeau: this `std::string` instance being returned is eligible for a simple RVO optimisation. Visual Studio does that even in Debug builds, I've checked that. So I have a _single heap allocation_ performed in char range constructor, that was my goal. It is possible to introduce `#ifdef _MSC_VER` and fallback to `new[]` operator, if needed. – Minor Threat Dec 03 '15 at 19:09
If you don't want to use a local `std::string` variable so you can utilize RVO, at least use `std::vector` instead of `new char[]` to capture the output, and then copy that data to the `std:::string` on the `return` statement. – Remy Lebeau Dec 03 '15 at 19:30
`std::unique_ptr` officially supports safe handling of `new[]` arrays, so this is not the case where `new[]` is inherently unsafe or I have to use `std::vector`. My goal is a single heap allocation per single UTF16->CP_ACP conversion, and I'll get at minimum two heap allocations if I will make use of `std::vector` or any other wrapper around `::operator new[]`. – Minor Threat Dec 03 '15 at 19:47
@Remy Lebeau: Added no-alloca variant. – Minor Threat Dec 03 '15 at 20:07
Your no-alloc variant is performing 2 allocations when `buffSize > BUFF_SIZE`, and you are allocating the stack buffer even though iti s not going to be used. Also, you can replace `buff + buffSize` with just `buffSize` when creating the `std::string`. – Remy Lebeau Dec 03 '15 at 20:18

Converting a unicode (with BOM) string to ASCII std::string

1 Answers1