3

You can read a file's contents into a char array using the following function:

void readFileContentsIntoCharArray(char* charArray, size_t sizeOfArray) {
    std::ifstream inputFileStream;
    inputFileStream.read(charArray, sizeOfArray);
}

Now the file is written in UTF-16LE, so I want to read the file's contents into a char16_t array in order to process it more easily later on. I tried the following code.

void readUTF16FileContentsIntoChar16Array(char16_t* char16Array, size_t sizeOfArray) {
    std::ifstream inputFileStream;
    inputFileStream.read(char16Array, sizeOfArray);
}

Ofcourse it didn't work. std::ifstream doesn't accept char16_t. I've been searching for a solution for a long time, but the only relevant one I've found so far is https://stackoverflow.com/a/10504278/1031769, which doesn't help because it uses wchar_t instead of char16_t.

How to make it work with char16_t?

Searene
  • 25,920
  • 39
  • 129
  • 186
  • Well you _can_ read `2*sizeOfArray` bytes then convert each 2 `char` to a `char16_t` manually..... if the architecture is big-endian you can't do better, if it's little-endian you can do some pointer-cast hack. – user202729 Sep 08 '18 at 07:05
  • "the file is written in UTF-16LE," Just transcode it before use with say `iconv`, and tell whoever has produced it to please stop right now, and use UTF-8. – n. m. could be an AI Sep 08 '18 at 15:11
  • "std::ifstream doesn't accept char16_t" You are supposed to cast your input array to `char*`. – n. m. could be an AI Sep 08 '18 at 15:23

2 Answers2

1

I have created a sample UTF-16LE file and this code was able to read it correctly. You can give it a try:

std::string readUTF16(const char* filename) {
    std::wifstream file(filename, std::ios::binary);
    file.imbue(std::locale(file.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>));

    std::wstring ws;
    for(wchar_t c; file.get(c); ) {
        ws += (char16_t) c;
    }
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
    return converter.to_bytes(ws);
}
HugoTeixeira
  • 4,674
  • 3
  • 22
  • 32
  • It doesn't work on Linux/macOS with surrogate pairs, it will trunc the surrogate pair and reserve only one byte of it. – Searene Sep 10 '18 at 03:18
1

You could read the bytes into a char16_t array and then convert the endianness manually (different architectures store wide characters in different memory order).

To do that you have to be able to detect the endianness of the machine you are running on.

I use this for this example but you may want to use a proper library version that has portable compile time checking:

bool is_little_endian()
{
    char16_t const c = 0x0001;
    return *reinterpret_cast<char const*>(&c);
}

Then you could do this:

std::u16string read_utf16le(std::string const& filename)
{
    // open at end to get size.
    std::ifstream ifs(filename, std::ios::binary|std::ios::ate);

    if(!ifs)
        throw std::runtime_error(std::strerror(errno));

    auto end = ifs.tellg();
    ifs.seekg(0, std::ios::beg);
    auto size = std::size_t(end - ifs.tellg());

    if(size % 2)
        throw std::runtime_error("bad utf16 format (odd number of bytes)");

    std::u16string u16;
    u16.resize(size / 2);

    if(u16.empty())
        throw std::runtime_error("empty file");

    if(!ifs.read((char*)&u16[0], size))
        throw std::runtime_error("error reading file");

    if(!is_little_endian())
    {
        // convert from big endian (swap bytes)
        std::transform(std::begin(u16), std::end(u16), std::begin(u16), [](char16_t c){
            auto p = reinterpret_cast<char*>(&c);
            std::swap(p[0], p[1]);
            return c;
        });
    }

    return u16;
}
Galik
  • 47,303
  • 4
  • 80
  • 117