0

I have a million entries in a CSV file and need to load it. However, it takes around 2 minutes to complete the loading. I need to fix this problem to make the data load faster. Is that any better way can figure it out? so I can research and try fixing it. Thank you for any help.

CSVReader.h

#pragma once
#include "OrderBookEntry.h"
#include <vector>
#include <string>

class CSVReader
{
public:
CSVReader();
static std::vector<OrderBookEntry> readCSV (std::string csvFile);
};

CSVReader.cpp

#include "CSVReader.h"
#include <iostream>
#include <fstream>

CSVReader::CSVReader() {

}

std::vector<OrderBookEntry> CSVReader::readCSV(std::string csvFilename) {
  std::vector<OrderBookEntry> entries;
  std::ifstream csvFile{csvFilename};
  std::string line;
  if (csvFile.is_open()) 
  {
      while (std::getline(csvFile, line))   
      { 
        
        try {
            OrderBookEntry obe = stringsToOBE(tokenise(line, ','));
            entries.push_back(obe);
        }
        catch(const std::exception& e){
            std::cout << "CSVReader::readCSV bad data" << std::endl;
        }
        
    }//end of while
}
std::cout << "Successfully read " << entries.size() << " entries" << std::endl;
return entries;
}

std::vector<std::string> CSVReader::tokenise(std::string csvLine, char separator) {
   std::vector<std::string>tokens;
   signed int start, end;
   std::string token;
   start = csvLine.find_first_not_of(separator, 0);
   do {
     end = csvLine.find_first_of(separator, start);
     if (start == csvLine.length() || start == end) break;
     if (end >= 0) token = csvLine.substr(start, end - start);
     else token = csvLine.substr(start, csvLine.length() - start);
     tokens.push_back(token);
     start = end + 1;
   } while (end > 0);

   return tokens;
}

OrderBookEntry CSVReader::stringsToOBE(std::vector<std::string>tokens) {
    double price, amount;
     if (tokens.size() != 5) {
        std::cout << "Bad Input" << std::endl;
        throw std::exception{};
    }
    try {
        //we have 5 tokens
        price = std::stod(tokens[3]);
        amount = std::stod(tokens[4]);
    }
    catch(const std::exception& e){
        std::cout << "Bad Float!" << tokens[3] << std::endl;
        std::cout << "Bad Float!" << tokens[4] << std::endl;
        throw;
    }

    OrderBookEntry
    obe{price,amount,tokens[0],tokens[1],OrderBookEntry::stringToOrderBookType(tokens[2])};
    return obe;

 }
  • 1
    Do you need all of the entries loaded in main memory at once? "A million entries" sounds like you should be using a database, not a spreadsheet. – JohnFilleau Dec 29 '21 at 05:44
  • 2
    Learn how to pass variables as references to functions instead of copying them when they are large. `stringsToOBE` should be `stringsToOBE(const std::vector& tokens)`. – JohnFilleau Dec 29 '21 at 05:47
  • 1
    1. Rethink your program's architecture. 2. Pass variables that are expensive to copy as const references. – JohnFilleau Dec 29 '21 at 05:49
  • 1
    1. What you wrote is not ideal but on a modern machine it should be faster. Maybe you can provide a the test file. 2. Be sure you test your program in Release configuration. Differences between Debug and Release configurations can be dramatic. 3. You did not understand `throw/catch. 4. `strtod` does not throw.` – zdf Dec 29 '21 at 06:22
  • 1
    If you know the exact size of the file and how large the std::vector will become, the most efficient way to my understanding is to use reserve on your std::vector. So if you have 1,000,000 lines for 'entries' do entries.reserve(1000000); to preallocate the memory. Based on my experience this is faster than using std::vector.push_back(); since push_back() has to add memory 1,000,000 times instead of doing it all at once with reserve(). You will also need to have a dummy counter so that you can index through the vector each time you read a line and add data to it. – Zachariah Rabatah Dec 29 '21 at 06:34
  • @ZachariahRabatah `reserve` changes the capacity of a `vector` but not the actual size. You would still use `push_back` or `emplace_back` to add items to the vector. The `reserve` would prevent reallocation until the extra capacity is exhausted.. – Retired Ninja Dec 29 '21 at 06:54
  • If it were me, I would as a test remove all the code except the `getline` loop (without processing the line’s contents at all), and see how long it takes. This will give you a benchmark, as your file access code is pretty much boilerplate. – DS_London Dec 29 '21 at 07:37
  • Also, this question gives some alternative methods of tokenising strings: https://stackoverflow.com/questions/236129/how-do-i-iterate-over-the-words-of-a-string – DS_London Dec 29 '21 at 07:45
  • @JohnFilleau and SQLite makes this easier than ever to do. By the way, databases also can do all operations such as "average price", "cheapest", "when and what" and so much more all by themselves in SQL, without wasting RAM or IO time on reading all data into memory beforehand just to probably throw it away after doing one simple thing that ignored over 99% of the data read. Not loading data you don't need is how you load data you need faster. It's baffling that people don't consider a database right away for almost any problem. Maybe when it becomes standard for a filesystem to act like one.. –  Dec 29 '21 at 07:52
  • @Sahsahae I don’t find it baffling. I can pass a CSV file to almost any platform: if it were a database file then I would need to ensure that the recipient has the correct database drivers to be able to decode the file. Using a database adds another layer of complexity and dependencies. In many instances this extra layer is well worth the trouble, but not always. – DS_London Dec 29 '21 at 08:07
  • @zdf `stod` does throw if it can't parse the string at all, it doesn't throw however if it can't parse the whole string – Alan Birtles Dec 29 '21 at 08:13
  • @DS_London including one single [sqlite3.c/.h](https://www.sqlite.org/index.html) pair of files into your existing codebase and learning how to use less than 20 functions which you won't be calling directly once you setup your data API for that project is a definition of trivial and reinventing a database on a crappy CSV reader that you yourself wrote is way more complex than learning how to write 4 lines of SQL. [This has been proven by the industry](https://www.sqlite.org/famous.html), works, is simple and elegant. –  Dec 29 '21 at 08:16
  • @AlanBirtles Yes, I misread. `strotd` vs `stod`. – zdf Dec 29 '21 at 08:16
  • @Sahsahae maybe I was unclear. I was replying to this bald generalisation: “It's baffling that people don't consider a database right away for almost any problem”. – DS_London Dec 29 '21 at 08:51
  • @DS_London every program works with some sort of data and curiously its always something completely dumb and unnecessary when a database would suffice. You use filesystem directly when you know for a fact that using a database will be a hindrance. If you actually tried it, you'd realize that it almost never is, which is why only specific solutions like `git` reinvent the wheel, not end-user software, which just uses a database, in times when its done correctly. Even the browser you're writing this nonsense on uses its own database, when "bare html file copies suffice for cache". –  Dec 29 '21 at 10:16
  • @Sahsahae I am not disagreeing with you, though in this case it seems likely that the OP has been given this data in CSV format, and that the data contains errors (possibly a spreadsheet saved as CSV). If I were doing this, I’d have a process that just read each line of the file, tokenized it, and then saved away the fields of valid records into … tada … a database. Then write a second process that worked on clean accessible data. It may be heretical but then I wouldn’t mind much if the screening process was inefficient as it only runs once per file. – DS_London Dec 29 '21 at 11:42
  • @DS_London it takes as long as disk IO does plus few minutes to convert any CSV into any database table... And its even reversible! The last thing I'd do is reinvent a CSV parser, but its okay if you get paid by the line of code you write... –  Dec 29 '21 at 11:48

2 Answers2

0

As mentioned in the comments, the first thing that should be optimized is the passing the arguments by value. I.e., there should be the const std::string& arg_name instead of std::string arg_name etc.

Then, the std::vector using is not so optimal too. It's because of the vector's nature: it stores its elements contiguously. It means that if at some point there will no memory to append the next element in such the way (contiguously), then all existing elements should be reallocated to a new place. So, if the vector using is not a critically needed, consider to substitute it with std::list or something else.

Then, I'd simplify the CSVReader::tokenise method something like that:

std::list<std::string> CSVReader::tokenise(const std::string& str, char sep)
{
  std::list<string> result;
  std::string::size_type cur = 0;
  for(auto pos = str.find_first_of(sep); pos != string::npos; cur = pos + 1, pos = str.find_first_of(sep, cur))
    result.emplace_back(str, cur, pos - cur);
  result.emplace_back(str, cur);

  return result;
}

Then, I'd combine the tokenise and stringsToOBE methods.

In general, review the code which is executed in the loop.

Serge Roussak
  • 1,731
  • 1
  • 14
  • 28
0

Think what sort of operations you are doing while parsing a single line and where the inefficiency comes from.

For instance,

std::vector<std::string> CSVReader::tokenise(std::string csvLine, char separator)

It accepts csvLine by copy. Copying a string is a potentially slow operation as it might involve memory allocation.

Then you create a vector of strings from a string... wouldn't it be much better to create a vector of std::string_view? So you don't copy each element of the string. To make the function safer require the tokenise to accept a std::string_view as well rather than std::string or const std::string&

Next, by reviewing CSVReader::stringsToOBE(std::vector<std::string>tokens), apparently I see that you actually require the size to be 5. And then parse some elements as floating points and copy some as string and another convert to enum.

Wouldn't it make more sense to require the input to be std::array<std::string_view,5>? To compile-time enforce the size to be 5 rather than test it? To adapt to this change you can make tokenize into a template that accepts a number n as template parameter and converts a std::string_view into std::array<std::string_view,n> based on the separator and in the end verifies that it is the end of the string view.

This saves another dynamic allocation due to elimination of unnecessary std::vector.

But why do you get a string from ifstream and then parse it and convert each element separately?

Why not straight up extract elements from the stream directly without creating an intermediate string? Say, add function OrderBookEntry CSVReader::extractOBEElement(std::istream& stream)? And just use stream's native functions like >>?

It might work and be more efficient but streams are known to be slow, so generating a string first might indeed be preferable.

ALX23z
  • 4,456
  • 1
  • 11
  • 18