0

I am writing some code to combine two .txt files containing test data captured for the same equipment, but taken on separate occasions. The data is stored in a .csv format.

EDIT: (As in while they are saved as .txt (UTF8 with BOM encoding), they are formatted to appear like a csv file)

Without worrying about the combining part, I was sorting out some issues with reading the files due to my relative inexperience with C++ when I noticed a mismatch of several thousand bytes between the file size reported by a couple methods and what was actually capable of being read in before reaching the EOF. Does anyone know what may be causing this?

Methods used to check file size before reading in:

  1. Constructing a std::filesystem::directory_entry object for the file in question. Then, calling it's .file_size() method. Returns 733435 bytes.
  2. Constructing fstream object for the file then the following code:
#include <iostream>
#include <fstream>

int main() {
    std::fstream data_file(path_to_file, std::ios::in);
    int file_size; \\ EDIT: Was in the wrong scope

    if (data_file.is_open()) {
        

        data_file.seekg(0, std::ios_base::end);
        file_size = data_file.tellg();
        data_file.seekg(0, std::ios_base::beg);
    }
    
    std::cout << file_size << std::endl; \\ --> 733435 bytes

}
  1. Checking the properties of the file in file explorer. File size = 733435 bytes, size on disc = 737280 bytes.

Then when I read in the file as follows:

#include <iostream>
#include <fstream>

int main() {
    std::fstream data_file(path_to_file, std::ios::in);
    
    if (data_file.is_open()) {
        int file_size, chars_read;

        data_file.seekg(0, std::ios_base::end);
        file_size = data_file.tellg();
        data_file.seekg(0, std::ios_base::beg);

        std::cout << "File size: " << file_size << std::endl;
        // |--> "File size: 733425"
    
        char* buffer = new char[file_size];

        // This sets both the eofbit & failbit flags for the stream
        // As is expected if the stream runs out of characters to read in
        // Before n characters are read in. (istream::read(char* s, streamsize n))
        data_file.read(buffer, file_size);

        // We can check the number of chars read in using istream::gcount()
        chars_read = data_file.gcount();

        std::cout << "Chars read: " << chars_read << std::endl;
        // |--> "Chars read: 716153"

        delete[] buffer;
        data_file.close();
    }

}

The mystery deepens somewhat when you look at the contents that are read in. The file is read in using three slightly different methods.

  1. Reading in the data line-by-line to a std::vectorstd::string directly from the filestream.
std::fstream stream(path_to_file, std::ios::in);
std::vector<std::string> v;
std::string s;

while (getline(stream, s, '\n')) {
    v.push_back(s);
}
  1. Read in the data using fstream::read(...) as above, then convert to lines using a stringstream object.
//... data read into char* buffer;
std::stringstream ss(buffer, std::ios::in);
std::vector<std::string> v2;
while (getline(ss, s, '\n')) {
    v2.push_back(s);
}

As far as I can tell, these should have the same contents. But...

std::cout << v.size() << std::endl;  //  --> 17283
std::cout << v2.size() << std::endl; // --> 17688

EDIT: The file itself has 17283 lines, the last of which is empty

In conclusion, a mismatch of just over 17000 bytes of the expected & measured file size, and a mismatch between the number of lines outputted by two different methods of processing mean that I have no idea what's going on.

Any suggestions are helpful, including more ways to test what's going on.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • i dont understand the first part of the question. 1) 733435 bytes 2) 733435 bytes 3) 733435 bytes. Nothing wrong with that, no? – 463035818_is_not_an_ai Oct 04 '21 at 15:31
  • My point with those three examples is that 3 separate methods measure the file size the same. Then, when the file is read in, the mismatch occurs. – Ben Andrews Oct 04 '21 at 15:36
  • please post a [mcve] of how you read the file. How do you read into `buffer` ? – 463035818_is_not_an_ai Oct 04 '21 at 15:36
  • 1
    You're giving us a lot of information ... but it really doesn't look like any of it is *HELPFUL* to determine where the problem is ... or even *IF* there's actually a problem. SUGGESTION: what happens when you visually inspect the files side-by-side (e.g. in notepad or vi)? Do you see any discrepencies/any "missing lines/missing data"? – paulsm4 Oct 04 '21 at 15:41
  • There's an example of using fstream::read(...) in the second code block, reading lines directly to a vector line-by-line in the third code block, and converting the original buffer into a vector in the fourth code block. I can collate these if you think it increases readability? I'm currently writing out a short script & example .txt file which should illustrate the problem – Ben Andrews Oct 04 '21 at 15:47
  • @BenAndrews: Answers go in the answer section, not in the question. – Nicol Bolas Oct 04 '21 at 17:31
  • 1
    "Size on disk" could actually mean the size your filesystem uses to store the file. In most cases this is a multiple of the _sector size_, often a power of two >= 512. The number 737280 happens to be a multiple of 16384. Could be coincidence... – Emmef Oct 04 '21 at 18:58

1 Answers1

4

fstream opens the file in "text" mode by default. On many platforms, this makes no difference, but specifically on Windows systems, text mode will automatically perform character conversion. \r\n on the filesystem will be read as simply \n.

See Difference between opening a file in binary vs text for more discussion. In one of the answers, there is a discussion about the allowable use of seek() and tell().

An easy thing to try is open in binary mode: OR this flag std::ios::binary with your ::in flag.

Peter
  • 14,559
  • 35
  • 55
  • Thanks for this. I'd read about what Windows does with newline characters but it didn't occur to me that it would affect it this way. Cheers! – Ben Andrews Oct 04 '21 at 15:58