2

I am trying to read from a file using the Windows function ReadFile(), but when I print the message it prints too many characters.

It doesn't matter if I read from an ANSII file or UNICODE file, I don't get the right characters.

Text in file is : "This is a text file".

Screen shot for the ANSII file: screen shot for ANSII file ReadFile()

Screen shot for the UNICODE file: screen shot for the UNICODE file ReadFile()

What Am I doing wrong?

#define BUFSIZE 4000


int _tmain(int argc, TCHAR *argv[])
{
    HANDLE  hIn;
    TCHAR buffer[BUFSIZE];
    DWORD nIn = 0;

    //create file
    hIn = CreateFile(argv[1],
        GENERIC_READ,
        FILE_SHARE_READ,
        NULL,
        OPEN_EXISTING,
        FILE_ATTRIBUTE_NORMAL,
        NULL);
    //check the handle
    if (hIn == INVALID_HANDLE_VALUE)
    {
        printf("\nOpen file error\n");
    }
    //read from file
    if (FALSE == ReadFile(hIn, buffer, BUFSIZE - 1, &nIn, NULL))
    {
        printf("Terminal failure: Unable to read from file.\n GetLastError=%08x\n", GetLastError());
        CloseHandle(hIn);
        return 0;
    }

    if (nIn > 0 && nIn <= BUFSIZE - 1)
    {
        buffer[nIn] = TEXT('\0'); // NULL character
        _tprintf(TEXT("Data read from %s (%d bytes): \n"), argv[1], nIn);
    }
    else if (nIn == 0)
    {
        _tprintf(TEXT("No data read from file %s\n"), argv[1]);
    }
    else
    {
        printf("\n ** Unexpected value for nIn ** \n");
    }
    printf("1:%s\n", buffer);
    _tprintf(TEXT("\n2:%s"), buffer);

    return 0;
}
Richard Chambers
  • 16,643
  • 4
  • 81
  • 106
Alex Balu
  • 23
  • 1
  • 3

1 Answers1

1

The Windows API function ReadFile() reads bytes, an unsigned char, and not the Windows UNICODE sized TCHAR which in modern Windows is a two byte and not a one byte as in Windows 95, etc. So you need to make the following modifications.

See also What is the difference between _tmain() and main() in C++? which has some additional information about the different compilation targets for Windows and the character encodings used.

First of all your buffer should be a BYTE type and not a TCHAR.

Secondly you need to make sure that it is zero filed so initialize the buffer as in BYTE buffer[BUFSIZE] = {0};.

Since Windows UNICODE is UTF-16 or two bytes per character you need to make sure that the end of string character for a UNICODE text string is two bytes of binary zero and you need to take this into account for your buffer length. When placing your end of string you need to make sure that it is two bytes of zero and not just one.

You should read BUFSIZE - 2 bytes to make sure that you read an even number of bytes in case it is a UNICODE string you are reading. And your buffer size should be a multiple of two as well which it is.

If the string is an ANSI string that you read in then when displayed as UNICODE it will probably look like garbage because each UNICODE character will be composed of two ANSI characters.

So to make the strings the same you will need to translate between the two character encodings. See this article about Using Byte Order Marks in text files to indicate the kind of character encoding being used in the file.

Community
  • 1
  • 1
Richard Chambers
  • 16,643
  • 4
  • 81
  • 106
  • Oof. It doesn't actually have to be *zero-filled*, it just needs to be *zero-terminated*. Your way works, but my premature optimization compulsion is tingling. It would be perfectly sufficient to append a NUL character to the end of the buffer after you get the actual buffer length when `ReadFile` returns. Also, I know you're trying to keep things simple here, and the truth is heinously complicated and depressing, but it's basically impossible to determine what character encoding a file uses. There are some heuristics, but they're unreliable. You need to be told either by metadata or the user. – Cody Gray - on strike Apr 22 '17 at 13:19
  • @CodyGray so I know it does not need to be zero-filled but on the other hand starting from a known state helps with debugging and viewing data structures in a debugger. And after the first read, it is no longer zero filled anyway. I know that trying to determine character encoding by inspecting a byte stream is basically impossible. That is why `Content-Encoding:` and `Content-Type:` exists in http protocol. However since it is his file then he can use byte order marks or file extension or any other method he wants to use. – Richard Chambers Apr 22 '17 at 13:47
  • *"I know that trying to determine character encoding by inspecting a byte stream is basically impossible. That is why `Content-Encoding:` and `Content-Type:` exists in http protocol."* - Uhm, no. That's why BOMs exist. Lacking that, you can run some heuristics against the buffer, e.g. calling [IsTextUnicode](https://msdn.microsoft.com/en-us/library/windows/desktop/dd318672.aspx). Besides, there is no encoding called *"ANSII"*, and `TCHAR` is **not** a Unicode code unit (`wchar_t` is). – IInspectable Apr 22 '17 at 17:39
  • 1
    @IInspectable BOMs only apply to UTF encodings, not ANSI encodings. There are a lot more ANSI encodings than there are UTFs. Being told the file encoding explicitly is better than guessing. And `IsTextUnicode()` is guessing (and is known to guess wrong at times). And in UNICODE compilations, `TCHAR` is `wchar_t`. – Remy Lebeau Apr 23 '17 at 05:07
  • TCHAR is not two bytes wide in modern Unicode windows. Its size is determined by conditionals. It can be one or two bytes wide. – David Heffernan Apr 23 '17 at 06:38
  • 1
    @RemyLebeau: *"Every fish is a goldfish (as long as you only consider goldfish)."* - If you follow that school of thought, then yes, every `TCHAR` is a `wchar_t`. – IInspectable Apr 24 '17 at 11:20
  • Looks like the latest versions of Visual Studio defaults to UNICODE enabled if you do a Create New Project. You have to change the Properties to get Multi-byte. I am curious why someone would target some version of Windows prior to Windows XP and not compile C/C++ with UNICODE enabled for anything other than a special target? Seems that Microsoft has done an end of life on everything previous to Windows 7 except for a couple of specialty Window builds such as POS Ready 2009 (Windows XP). Certainly 16 bit Windows 95/98/ME are no longer supported. – Richard Chambers Apr 24 '17 at 15:43
  • @DavidHeffernan "*TCHAR is not two bytes wide in modern Unicode windows.*" - Yes, it is. As I said, "*In **UNICODE** compilations*", meaning the `UNICODE` (and `_UNICODE`) conditional is defined, then `TCHAR` (and `_TCHAR`) maps to `wchar_t`, which is 2 bytes on Windows. MSDN says so. You know this, why are you debating it? Are you thinking of the `_MBCS` and `DBCS` conditionals? Because those are legacy, nobody really uses them anymore, and I wasn't referring to them anyway, I specifically said `UNICODE`. – Remy Lebeau Apr 24 '17 at 15:44
  • @IInspectable I specifically said "*in **UNICODE** compilations*", so in that situation, yes, every `TCHAR` is a `wchar_t`, by definition. – Remy Lebeau Apr 24 '17 at 15:47
  • @RemyLebeau My comment was addressed to the answer, specifically its first sentence. Your comment was of course accurate. – David Heffernan Apr 24 '17 at 15:47
  • @RichardChambers: Windows 1/2/3 were 16 bit. Windows 9x are 32-bit. And Visual Studio 2017 is at least the third release that defaults to Unicode project settings. This is nothing new. MFC's MBCS build is no longer part of a standard installation since VS 2012; it's a separate download. It's just unnerving, that people **still** use `TCHAR`, and those that should know better do not take offense. – IInspectable Apr 24 '17 at 16:44