0

Say we have this example file.xml

<?xml version="1.0" encoding="utf-8"?>
<root>
    <something>yes</something>
    <something2>no</something2>
</root>

Now if we try to read it using this piece of code:

int main() {
    fstream file;
    file.open("test.xml", fstream::in);
    const int GIGA_SIZE = 65535;
    char buffer[GIGA_SIZE + 1] = {};
    file.read(buffer, GIGA_SIZE);
}

What we get in the buffer is this:

<?xml version="1.0" encoding="utf-8"?>
<root>
    <something>yes</something>
    <something2>no</something2>
</root>
ot>

From where those additional characters keep coming from? In the documentation it is stated that istream::read after reaching eof should only extract characters read up to that point. Buffer is initialized with '\0', even I added line buffer[111] = '\0' where 110 is the amount of characters the file have. Problem still occurs. What is interesting when we change the code to this:

int main() {
    fstream file;
    file.open("test.xml", fstream::in);
    const int GIGA_SIZE = 65535;
    char buffer[GIGA_SIZE + 1] = {};

    int i = 0;
    while (!file.eof()) {
        file.read(&buffer[i], 1);
        ++i;
    }
}

Then the file is read properly, without additional "ot>". I'm using c++17 on Visual Studio 2017

UpAndAdam
  • 4,515
  • 3
  • 28
  • 46
JAJA
  • 40
  • 5
  • 2
    No need to use char arrays for this, especially non-standard variable size ones. [How do I read an entire file into a std::string in C++?](https://stackoverflow.com/q/116038/260313) – rturrado Jan 19 '22 at 20:45
  • Looks like you will bump heads with [Why is iostream::eof inside a loop condition (i.e. `while (!stream.eof())`) considered wrong?](https://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-i-e-while-stream-eof-cons) in your new version. – user4581301 Jan 19 '22 at 20:54
  • Open the file in binary mode. What you are seeing is probably line-ending transformations.: `file.open("in.txt", fstream::in|fstream::binary);` – user4581301 Jan 19 '22 at 21:04
  • @user4581301 you are right! Consider answering the question. Btw what does "line-ending transformation" means, why does it happen when we are in a text mode? – JAJA Jan 19 '22 at 21:21
  • What does `file.gcount()` return? Also, please post the _binary_ contents of the buffer, not just the text. There may be a Ctrl+Z (eof) character, the file might actually be longer than the text, the possibilities are endless. And someone already mentioned line-ending transformations. – Hajo Kirchhoff Jan 19 '22 at 21:21
  • Text mode is allowed to modify characters to help you write simpler code. The most common is the carriage return + linefeed used in DOS and Windows-based operating systems. Writing code to handle the different line endings on different systems (especially when it's more than one character) is a waste of time, so the stream handles it for you, quietly substituting in `'\n'` when it finds the system's official line ending in the stream. In this case two characters become one character and your buffer contents get screwed up. – user4581301 Jan 19 '22 at 21:39
  • I didn't pitch this as an answer because A) There's a duplicate out there somewhere and B) using a smarter reader like @rturrado suggested is much better. – user4581301 Jan 19 '22 at 21:40
  • I've come accross this 2 liner that looks pretty cool: `std::ifstream ifs{ file_path, std::fstream::in | std::fstream::binary }; std::vector buffer{ std::istreambuf_iterator{ifs}, {} };` Based on this other answer: https://stackoverflow.com/a/18816870/260313 – rturrado Jan 19 '22 at 21:46
  • I understand. Much thanks guys for clarification. – JAJA Jan 19 '22 at 21:54

0 Answers0