fstream read retuns additional characters

Question

Say we have this example file.xml

<?xml version="1.0" encoding="utf-8"?>
<root>
    <something>yes</something>
    <something2>no</something2>
</root>

Now if we try to read it using this piece of code:

int main() {
    fstream file;
    file.open("test.xml", fstream::in);
    const int GIGA_SIZE = 65535;
    char buffer[GIGA_SIZE + 1] = {};
    file.read(buffer, GIGA_SIZE);
}

What we get in the buffer is this:

<?xml version="1.0" encoding="utf-8"?>
<root>
    <something>yes</something>
    <something2>no</something2>
</root>
ot>

From where those additional characters keep coming from? In the documentation it is stated that istream::read after reaching eof should only extract characters read up to that point. Buffer is initialized with '\0', even I added line buffer[111] = '\0' where 110 is the amount of characters the file have. Problem still occurs. What is interesting when we change the code to this:

int main() {
    fstream file;
    file.open("test.xml", fstream::in);
    const int GIGA_SIZE = 65535;
    char buffer[GIGA_SIZE + 1] = {};

    int i = 0;
    while (!file.eof()) {
        file.read(&buffer[i], 1);
        ++i;
    }
}

Then the file is read properly, without additional "ot>". I'm using c++17 on Visual Studio 2017

No need to use char arrays for this, especially non-standard variable size ones. [How do I read an entire file into a std::string in C++?](https://stackoverflow.com/q/116038/260313) — rturrado, Jan 19 '22 at 20:45
Looks like you will bump heads with [Why is iostream::eof inside a loop condition (i.e. `while (!stream.eof())`) considered wrong?](https://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-i-e-while-stream-eof-cons) in your new version. — user4581301, Jan 19 '22 at 20:54
Open the file in binary mode. What you are seeing is probably line-ending transformations.: `file.open("in.txt", fstream::in|fstream::binary);` — user4581301, Jan 19 '22 at 21:04
@user4581301 you are right! Consider answering the question. Btw what does "line-ending transformation" means, why does it happen when we are in a text mode? — JAJA, Jan 19 '22 at 21:21
What does `file.gcount()` return? Also, please post the _binary_ contents of the buffer, not just the text. There may be a Ctrl+Z (eof) character, the file might actually be longer than the text, the possibilities are endless. And someone already mentioned line-ending transformations. — Hajo Kirchhoff, Jan 19 '22 at 21:21
Text mode is allowed to modify characters to help you write simpler code. The most common is the carriage return + linefeed used in DOS and Windows-based operating systems. Writing code to handle the different line endings on different systems (especially when it's more than one character) is a waste of time, so the stream handles it for you, quietly substituting in `'\n'` when it finds the system's official line ending in the stream. In this case two characters become one character and your buffer contents get screwed up. — user4581301, Jan 19 '22 at 21:39
I didn't pitch this as an answer because A) There's a duplicate out there somewhere and B) using a smarter reader like @rturrado suggested is much better. — user4581301, Jan 19 '22 at 21:40
I've come accross this 2 liner that looks pretty cool: `std::ifstream ifs{ file_path, std::fstream::in | std::fstream::binary }; std::vector buffer{ std::istreambuf_iterator{ifs}, {} };` Based on this other answer: https://stackoverflow.com/a/18816870/260313 — rturrado, Jan 19 '22 at 21:46

fstream read retuns additional characters

0 Answers0