0

I am working on an application in which java JCA AES is used to encrypt string values which get decrypted inside the c++ app. I am using the crypto++ library for decryption and able to recover the original bytes but stuck at last step where I need to convert them back to original UTF16 encoded string. I have the bytes stored inside std::vector data structure. Content of vector (in hex) {fe ff 00 49 00 6c 00 6f 00 76 00 65 00 6a 00 61 00 76 00 61 }

How to convert this to UTF16 string?

  • 1
    Can you be a bit more precise about what you want to convert them to? Why is that vector not already a UTF16 string? Do you mean an instance of `std::wstring`? – David Schwartz Jun 30 '19 at 21:49
  • The data structure is a vector of bytes(std::vector ), and the bytes stored inside represent UTF16 encoded "Ilovejava" with BOM(u\feff) in the first two positions. I need to convert this vector to a string value "Ilovejava". – sunil kumar Jun 30 '19 at 22:04
  • Yoz need to tell more about the target platform. UTF-16 is platform dependent because of endianess. – zett42 Jun 30 '19 at 22:15
  • It's going to be SLES 10. Is it possible to write a platform independent solution? – sunil kumar Jun 30 '19 at 22:21
  • @CarlosHeuberger. I need c++ version of bytes to string conversion. – sunil kumar Jun 30 '19 at 22:32
  • @sunilkumar The vector already contains the string value "ilovejava" in UTF16 encoding. Do you want it encoded some other way? If so, what way? Do you want it in some other type? If so, what type? – David Schwartz Jun 30 '19 at 22:44
  • @DavidSchwartz convert it to std::u16string – sunil kumar Jun 30 '19 at 23:24

1 Answers1

1

First of all, in C++11 you have char16_t and std::u16string which represent a UTF-16 code point and an UTF-16 encoded string. Furthermore, you can use std::codecvt to convert back and forth between UTF-16 and other representations (UTF-8, the platform version of wchar_t, etc.)

Thus, you can use the data from the string. However, first you have to convert it from a vector of bytes to a sequence of char16_t, so first you need to check that your bytes are actually even, and then copy or reinterpret_cast them as char16_t. However, before doing this you need to handle the possibly different endianness of the data and that of your platform.

In the specific data sample you show, the first bytes are FE and FF, which very likely are the byte order mark, a character used precisely to distinguish the endianness of the platform. Very summarized, U+FEFF may appear to the computer as the bytes (FE FF) or (FF FE). If your platform has the opposite endianness to the data stream, you will read that first character as U+FFFE, which is a slot deriberately left open and should never actually appear - so you know you have to swap the bytes of the whole stream. Otherwise, if you read U+FEFF correctly, you just leave the stream as it is and proceed to the interpretation of bytes as char16_t.

Note that this is possible because the input stream specifically has this mark as the first character; otherwise you would have no way to know this for sure absent external metadata marking the stream as UTF-16LE (little-endian) or UTF-16BE (big endian). In some cases, there is such metadata (e.g. because the Java language spec may say so), but in others the absence of a BOM leads to heuristics being applied. For example, if you know the text is mainly English there should be a lot of 00 bytes and you can see if they end up predeominantly in the even or odd positions... but this has a chance of failure, maybe you are seeing Chinese text and there are not so many nulls.

Javier Martín
  • 2,537
  • 10
  • 15