First of all, in C++11 you have char16_t and std::u16string which represent a UTF-16 code point and an UTF-16 encoded string. Furthermore, you can use std::codecvt to convert back and forth between UTF-16 and other representations (UTF-8, the platform version of wchar_t, etc.)
Thus, you can use the data from the string. However, first you have to convert it from a vector of bytes to a sequence of char16_t, so first you need to check that your bytes are actually even, and then copy or reinterpret_cast
them as char16_t. However, before doing this you need to handle the possibly different endianness of the data and that of your platform.
In the specific data sample you show, the first bytes are FE and FF, which very likely are the byte order mark, a character used precisely to distinguish the endianness of the platform. Very summarized, U+FEFF may appear to the computer as the bytes (FE FF) or (FF FE). If your platform has the opposite endianness to the data stream, you will read that first character as U+FFFE, which is a slot deriberately left open and should never actually appear - so you know you have to swap the bytes of the whole stream. Otherwise, if you read U+FEFF correctly, you just leave the stream as it is and proceed to the interpretation of bytes as char16_t
.
Note that this is possible because the input stream specifically has this mark as the first character; otherwise you would have no way to know this for sure absent external metadata marking the stream as UTF-16LE (little-endian) or UTF-16BE (big endian). In some cases, there is such metadata (e.g. because the Java language spec may say so), but in others the absence of a BOM leads to heuristics being applied. For example, if you know the text is mainly English there should be a lot of 00 bytes and you can see if they end up predeominantly in the even or odd positions... but this has a chance of failure, maybe you are seeing Chinese text and there are not so many nulls.