Read CSV File Very Fast

Question

i have a csv file and i have to read this file with only fstream library. There are 8 columns but i will use only first three columns. This file contains 591.000 lines data.

I tried to read like this;

while (retailFile.good()) {
        if (i == 0) continue;
        getline(retailFile, invoiceNo, ';');
        getline(retailFile, stockCode, ';');
        getline(retailFile, desc, ';');
        getline(retailFile, dummy, ';');
        getline(retailFile, dummy, ';');
        getline(retailFile, dummy, ';');
        getline(retailFile, dummy, ';');
        getline(retailFile, dummy);
        i++;
    }

Tried like that - I wasn't too hopeful - it was a complete disappointment.

How can read very fast? It's ridiculous to keep it in an empty variable. Can't we pass that columns?

You are correct. That is a bit wasteful. [Take inspiration from option two of this linked answer](https://stackoverflow.com/a/7868998/4581301) You probably won't find much of an actual speed difference though. Reading files off a disk is usually more time consuming than doing anything with the file (unless you do a lot of stuff with the file's data) — user4581301, Dec 21 '19 at 19:57
Note: `while (retailFile.good())` shares a number of problems with [Why is iostream::eof inside a loop condition (i.e. `while (!stream.eof())`) considered wrong?](https://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-i-e-while-stream-eof-cons) — user4581301, Dec 21 '19 at 20:01
Addition to my first comment: To find the end of the line, you have to read through all of the columns in the line looking for the end of the line. This is unavoidable. — user4581301, Dec 21 '19 at 20:06
Did you try the optimized / release build. Debug builds can be much slower. I have seen cases where Visual Studio the Debug build took 100 times as long as Release in a particular algorithm. — drescherjm, Dec 21 '19 at 20:10
*It's ridiculous to keep it in an empty variable.* -- Please clarify what this is supposed to mean. A compiler's optimizer would more than likely just throw away the unused variable. — PaulMcKenzie, Dec 21 '19 at 20:11

user4581301 · Accepted Answer · 2019-12-21T20:25:50.040

To find the end of the line, you have to read through all of the columns in the line looking for the end of the line. This is unavoidable. You do not have to process those unwanted fields though.

Taking inspiration from option two of this linked answer I get something like

//discard first line without looking at it. 
if (retailFile.ignore(std::numeric_limits<std::streamsize>::max(), '\n')
{ // ALWAYS test IO transactions to make sure they worked, even something as 
  // trivial as ignoring the input. 

    std::string line;
    while (std::getline(retailFile, line))
    { // read the whole line
        // wrap the line in a stream for easy parsing
        std::istringstream stream (line);
        if (std::getline(retailFile, invoiceNo, ';') && 
            std::getline(retailFile, stockCode, ';') &&
            std::getline(retailFile, desc, ';'))
        { // successfully read all three required columns
          // Do not use anything you read until after you know it is good. Not 
          // checking leads to bugs and malware.

          // strongly consider doing something with the variables here. The next loop 
          // iteration will write over them
            i++;
        }
        else
        {
            // failed to find all three columns. You should look into why and 
            // handle accordingly.
        }
    }
}
else
{
    // failed to ignore the line. You should look into why and handle accordingly.
}

You probably won't find much of an actual speed difference. Reading files off a disk is usually more time consuming than doing anything with the file unless you do a lot of stuff with the file's data after reading it. There are potentially faster ways to do the splitting of the line, but again the difference is probably buried in the cost of reading the file in the first place.

A M · Answer 2 · 2019-12-22T08:56:03.020

The question is: What is fast?

In the below demo, I create I file with 591.000 lines. Size is 74MB.

Then I set a bigger input buffer for the std::ifstream, read all lines, parse them, and copy the first 3 entries into the resulting vector. The rest I do ignore.

To avoid that the result is optimized away, I show 50 lines of output.

VS2019, C++17, Release Mode, all optimizations on.

Result: ~2.7s for reading and parsing all lines on my machine. (I must admit that I have 4 SSDs in RAID 0 via PCIe)

#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <regex>
#include <array>
#include <chrono>
#include <iterator>

int main() {
    // Put whatever filename you want
    static const std::string fileName{ "r:\\big.txt" };

    // Start Time measurement
    auto start = std::chrono::system_clock::now();
#if 0
    // Write file with 591000 lines
    if (std::ofstream ofs(fileName); ofs) {
        for (size_t i = 0U; i < 591000U; ++i) {
            ofs << "invoiceNo_" << i << ";"
                << "stockCode_" << i << ";"
                << "description_" << i << ";"
                << "Field_4_" << i << ";"
                << "Field_5_" << i << ";"
                << "Field_6_" << i << ";"
                << "Field_7_" << i << ";"
                << "Field_8_" << i << "\n";
        }
    }
#endif
    auto end = std::chrono::system_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    // How long did it take?
    std::cout << "Time for writing the file:       " << elapsed.count() << " ms\n";


    // We are just interested in 3 fields
    constexpr size_t NumberOfNeededFields = 3U;

    // We expect 591000 lines, give a little bit more
    constexpr size_t NumberOfExpectedFilesInFile = 600000U;

    // We will create a bigger input buffer for our stream
    constexpr size_t ifStreamBufferSize = 100000U;
    static char buffer[ifStreamBufferSize];

    // The delimtzer for our csv
    static const std::regex delimiter{ ";" };

    // Main working variables
    using Fields3 = std::array<std::string, NumberOfNeededFields>;

    static Fields3 fields3;
    static std::vector<Fields3> fields{};

    // Reserve space to avoid reallocation
    fields.reserve(NumberOfExpectedFilesInFile);

    // Start timer
    start = std::chrono::system_clock::now();

    // Open file and check, if it is open
    if (std::ifstream ifs(fileName); ifs) {
        // Set bigger file buffer
        ifs.rdbuf()->pubsetbuf(buffer, ifStreamBufferSize);

        // Read all lines
        for (std::string line{}; std::getline(ifs, line); ) {
            // Parse string
            std::copy_n(std::sregex_token_iterator(line.begin(), line.end(), delimiter, -1), NumberOfNeededFields, fields3.begin());
            // Store resulting 3 fields
            fields.push_back(std::move(fields3));
        }
    }
    end = std::chrono::system_clock::now();
    elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    std::cout << "Time for parsing the file:       " << elapsed.count() << " ms\n";

    // Show some result 
    for (size_t i = 0; i < fields.size(); i += (fields.size()/50)) {
        std::copy_n(fields[i].begin(), NumberOfNeededFields, std::ostream_iterator<std::string>(std::cout, " "));
        std::cout << "\n";
    }
    return 0;
}

Read CSV File Very Fast

2 Answers2