Is there no possible loss of data when converting from wstring to string using string constructor?

Question

When I do the following my compiler warns me of a possible loss of data (but the compilation is succesful):

std::vector<wchar_t> v1;
v1.push_back(L'a');
std::vector<char> v2(v1.begin(), v1.end());

When I do the following I get no such warning, and as far as I can tell I have not lost data when I've done it in the past:

std::wstring w1;
w1 = L"a";
std::string s1(w1.begin(), w1.end());

Is there in fact no possible loss of data in the second snippet? And if, not why not? Is there something in the basic_string constructor that handles the possibility of iterators of the other type of character? Or is it something special about the iterators themselves?

score 1 · Answer 1 · answered Nov 21 '13 at 16:10

1

Yes, the second snippet will lose data (truncate the character values) in the same way the first snippet will. Your library implementation is probably doing something that suppresses the warning message. It's impossible to know without looking at the source for your particular library implementation.

answered Nov 21 '13 at 16:10

Mark Ransom

299,747
42
398
622

That's interesting to hear. I've seen that technique for conversion of wstring to string so often that I just assumed there was no potential data loss. – John Fitzpatrick Nov 21 '13 at 16:39
@JohnFitzpatrick, most people only deal with characters in the range 0-255 so there isn't any loss. You'll only see it with characters above that range. – Mark Ransom Nov 21 '13 at 16:47
@MarkRansom strictly speaking, char is most likely signed, so its range is really -127 to +128, and there will be data loss if the narrow encoding doesn’t match the wide one within the 8-bit range anyway. – al45tair Nov 21 '13 at 18:06
@alastair the truncation will produce the same bit pattern whether the values are signed or unsigned. Duly noted about the encoding, but lots of people use ISO 8859-1 or a minor variant which match Unicode in the 128-255 range. – Mark Ransom Nov 21 '13 at 18:10
@MarkRansom Actually, if `char` is signed, the result is implementation defined. I don’t know of an implementation that does other than what you suggest, but it’s up to the compiler. Also, you can’t assume that `std::wstring` contains Unicode. It may very well not on machines in China and Japan. – al45tair Nov 22 '13 at 08:49
@alastair, I thought the integer conversions were more tightly specified than that but I'm not enough of a language lawyer to cite chapter and verse. And even in those locales where wide characters aren't Unicode, don't they still have the ASCII codes in common? – Mark Ransom Nov 22 '13 at 14:46
@MarkRansom See C99 6.3.1.3.3, which says “either the result is implementation-defined or an implementation-defined signal is raised”. Truncation and overflow are only defined behaviour for *unsigned* values in C. As for non-Unicode wide characters being supersets of ASCII, it’s very likely but not guaranteed; the main case I know of where that won’t be the case is mainframes (which might use 16-bit EBCDIC wchar_t values). – al45tair Nov 25 '13 at 13:56

score 1 · Accepted Answer · answered Nov 21 '13 at 16:40

To give a concrete example, if you write

std::wstring w1 = L"τ"; // That's a Unicode Greek Small Letter Tau (U+03C4)
std::string  s1(w1.begin(), w1.end());

Most likely you’ll end up with a string containing character 0xC4, which is an “Ä” in both Windows ANSI and ISO Latin-1. That probably isn’t what you wanted, and while it will work OK on most platforms if you stick to ASCII, even that isn’t guaranteed (e.g. if your code runs on an IBM mainframe, you might find that narrow strings are EBCDIC and wide strings could be in any number of unusual encodings).

If you want to convert wide strings to narrow strings, you need to use appropriate functions to cope with the fact that character encodings are involved. C++ doesn’t really provide a decent way to do this; typically you have to revert to C’s wctombs() function, or use platform-specific APIs. (Someone might point you at the narrow ctype facet, but that just means that any character that can’t be represented by a single byte gets replaced with a specified character; that isn’t really converting. Also, C++11 has some support for converting between Unicode strings using wstring_convert, but that only copes with Unicode and not everyone is using that for both narrow and wide characters.)

Is there no possible loss of data when converting from wstring to string using string constructor?

2 Answers2