Can I use the _bstr_t class to convert between multibyte and Unicode on Windows? (C++)

Question

BSTR is this weird Windows data type with a few specific uses, such as COM functions. According to MSDN, it contains a WCHAR string and some other stuff, like a length descriptor. Windows is also nice enough to give us the _bstr_t class, which encapsulates BSTR; it takes care of the allocation and deallocation and gives you some extra functionality. It has four constructors, including one that takes in a char* and one that takes in a wchar_t*. MSDN's description of the former: "Constructs a _bstr_t object by calling SysAllocString to create a new BSTR object and then encapsulates it. This constructor first performs a multibyte to Unicode conversion."

It also has operators that can extract a pointer to the string as any of char*, const char*, and wchar_t*, and I'm pretty sure those are null-terminated, which is cool.

I've spent a while reading up on how to to convert between multibyte and Unicode, and I've seen a lot of talk about how to use mbstowcs and wcstomb, and how MultiByteToWideChar and WideCharToMultiByte are better because of encodings may differ, and blah blah blah. It all kind of seems like a headache, so I'm wondering whether I can just construct a _bstr_t and use the operations to access the strings, which would be... a lot fewer lines of code:

char* multi = "asdf";
_bstr_t bs = _bstr_t(mb);
wchar_t* wide = (wchar_t*)bs; // assume read-only

I guess my intuitive answer to this is that we don't know what Windows is doing behind the scenes, so if I have a problem using mbstowcs/wcstomb (I guess I really mean mbstowcs_s/wcstomb_s) rather than MultiByteToWideChar/WideCharToMultiByte, I shouldn't risk it because it's possible that Windows uses those. (It's almost certainly not using the latter, since I'm not specifying a "code page" here, whatever that is.) Honestly I'm not sure yet whether I consider the mbstowcs_s and wcstomb_s functions OK for my purposes, because I don't really have a grasp on all of the different encodings and stuff, but that's a whole different question and it seems to be addressed all over the Internet.

Sooooo, is there anything wrong with doing this, aside from that potential concern?

Ugh, so many weird string types! Yeah, I had seen that type referenced on http://www.codeproject.com/Articles/4829/Guide-to-BSTR-and-C-String-Conversions; I didn't really pay attention because the name was so ugly I figured it'd be really complicated. Any particular reason it's better? — melanie johnson, Jan 14 '13 at 22:33
A lot of this string conversion misery will disappear completely when you join the 21st century. Unicode is universal today, 8-bit encodings that are not utf-8 are a historical footnote whose useful life ended a long time ago. Just use `const wchar_t* multi = L"asdf";` and never look back again. — Hans Passant, Jan 14 '13 at 22:41
No particular reason, just that the naming convention fits in better with ATL, that's all. :) Feel free to use _bstr_t instead. — user541686, Jan 14 '13 at 22:41
Regarding what @HansPassant mentioned -- be aware that it really depends on *how* Unicode-compliant you want to be. Windows isn't always fully compliant either; a lot of code in Windows assumes 16 bits is a Unicode character, which it isn't (UTF-16 is a variable-length encoding; 32k characters isn't exactly enough). But for many cases it should be fine. — user541686, Jan 14 '13 at 22:43
Most Windows code does not assume that 16bits represent a single character. Microsoft dropped UCS-2 and switched to UTF-16 in Windows 2000, so a 16bit value can either represent a single character by itself, or be a member of a surrogate pair that represents a single character. Either way, Windows stopped assuming 16bits=1char a LONG time ago. — Remy Lebeau, Jan 17 '13 at 22:22

score 2 · Accepted Answer · answered Jan 14 '13 at 22:44

Using _bstr_t::_bstr_t(const char*) is not exactly a good idea in production code:

Constructs a _bstr_t object by calling SysAllocString to create a new BSTR object and encapsulate it. This constructor first performs a multibyte to Unicode conversion. If s2 is too large, you [sic] may generate a stack overflow error. In such a situation, convert your char* to a wchar_t with MultiByteToWideChar and then call the wchar_t * constructor.

Besides that _bstr_t::operator wchar_t*() const throw() seems barely useful. It's just for struct member extraction, so you're constrained to a const:

These operators can be used to extract raw pointers to the encapsulated Unicode or multibyte BSTR object. The operators return the pointer to the actual internal buffer, so the resulting string cannot be modified.

So _bstr_t is just a helper object for encapsulating BSTRs, and a mediocre one at that. Conversion using MultiByteToWideChar and WideCharToMultiByte is a much better choice, for multiple reasons:

It's much less prone to crash.
You don't get a const buffer in return, because you provide your own.
The names of those functions are self-descriptive. Conversion through a constructor and casting operator of an unrelated type is not.

The caveat concerning the stack overflow that `_bstr_t::_bstr_t(const char*)` may generate only applies to Visual Studio .Net 2003, not later versions — klaus triendl, Nov 28 '13 at 17:20

Can I use the _bstr_t class to convert between multibyte and Unicode on Windows? (C++)

1 Answers1