2

I've been tinkering with reading files(text files encoded in Unicode) and for some reason I get a question mark in the beginning of the output.

Here's the code.

#include <iostream>

#include <Windows.h>
#include <fcntl.h>
#include <io.h>

int main(void)
{
    HANDLE hFile = CreateFile(L"dog.txt",
                              GENERIC_READ,
                              NULL,
                              NULL,
                              OPEN_EXISTING,
                              FILE_ATTRIBUTE_NORMAL,
                              NULL);

    _setmode(_fileno(stdout), _O_U16TEXT); //Making sure the console will 
                                           //display the  wide characters 
                                           //correctly. See below for link

    LARGE_INTEGER li;
    GetFileSizeEx(hFile,&li); 

    WCHAR* pBuf = new WCHAR[li.QuadPart / sizeof(WCHAR)]; //Allocating space for 
                                                          //the file.

    DWORD dwRead = 0;
    BOOL bFinishRead = FALSE;
    do
    {
        bFinishRead = ReadFile(hFile,pBuf,li.QuadPart,&dwRead,NULL);
    } while(!bFinishRead);

    pBuf[li.QuadPart / sizeof(WCHAR)] = 0; //Making sure the end of the output 
                                           //is null-terminated.

    std::wcout << pBuf << std::endl;

    std::cin.get();

    return 1;
}

dog.txt

One Two Three

Console output

?One Two Three

I already eliminated a lot of gibberish by making sure the end of the output is null-terminated but the ? in the beginning puzzles me.

As for the

_setmode(_fileno(stdout), _O_U16TEXT);

see Output unicode strings in Windows console app

Note: My code is Windows-oriented and I intend to keep it that way if possible.

Thanks.

Community
  • 1
  • 1
Root
  • 309
  • 1
  • 3
  • 10

1 Answers1

6

It's probably a byte order mark (BOM). It's standard practice to insert a BOM at the beginning of a text file in UTF-16 to ensure it can be read correctly on different-endian systems (where the individual bytes encoding a UTF-16 double-byte value are in a different order). You can strip it by checking whether the first wchar_t is codepoint U+FEFF i.e. has value 0xfeff.

ecatmur
  • 152,476
  • 27
  • 293
  • 366