0

I am trying to read entries of my csv file into an entries vector whose data type is object OrderBookEntry, this is an object that stores the tokenized variables from the csv line which itself is converted through the function tokenise. My problem is that my csv file has around a million lines and it takes over 30 seconds to load, how can I reduce this?

This is my code that actually reads lines from the file and converts them into entries.


    std::vector<OrderBookEntry> CSVReader::readCSV(std::string csvFilename)
    {
        std::vector<OrderBookEntry> entries;
    
        std::ifstream csvFile{csvFilename};
        
        std::string line;
        if (csvFile.is_open())
        {
            while(std::getline(csvFile, line))
            {
                try {
                    OrderBookEntry obe = stringsToOBE(tokenise(line, ','));
                    entries.push_back(obe);
                    //std::cout << "pushing entry" << std::endl;
                }catch(const std::exception& e)
                {
                    //std::cout << "CSVReader::readCSV bad data"  << std::endl;
                }
            }// end of while
        }    
    
        std::cout << "CSVReader::readCSV read " << entries.size() << " entries"  << 
        std::endl;
        return entries; 
    } 

This is the part of my code that tokenises the read line


    std::vector<std::string> CSVReader::tokenise(std::string csvLine, char separator)
    {
       std::vector<std::string> tokens;
       signed int start, end;
       std::string token;
        start = csvLine.find_first_not_of(separator, 0);
        do{
            end = csvLine.find_first_of(separator, start);
            if (start == csvLine.length() || start == end) break;
            if (end >= 0) token = csvLine.substr(start, end - start);
            else token = csvLine.substr(start, csvLine.length() - start);
            tokens.push_back(token);
        start = end + 1;
        }while(end > 0);
    
       return tokens; 
    } 

This is the part of my code that converts the tokens into a suitable object to work with


    OrderBookEntry CSVReader::stringsToOBE(std::vector<std::string> tokens)
    {
        double price, amount;
    
        if (tokens.size() != 5) // bad
        {
            //std::cout << "Bad line " << std::endl;
            throw std::exception{};
        }
        // we have 5 tokens
        try {
             price = std::stod(tokens[3]);
             amount = std::stod(tokens[4]);
        }catch(const std::exception& e){
            std::cout << "Bad float! " << tokens[3]<< std::endl;
            std::cout << "Bad float! " << tokens[4]<< std::endl; 
            throw;        
        }
    
        OrderBookEntry obe{price, 
                            amount, 
                            tokens[0],
                            tokens[1], 
                            OrderBookEntry::stringToOrderBookType(tokens[2])};
    
        return obe; 
    } 
A M
  • 14,694
  • 5
  • 19
  • 44
  • 1
    Your development tools should come with a profiler. Use the profiler to see what part of the program eats up the greatest proportion of time. That's where you'll want to spend your effort. That said, bit of unnecessary copying due to pass by value at `OrderBookEntry CSVReader::stringsToOBE(std::vector tokens)` unless the compiler's really smart. I'd use a pass by `const` reference here. – user4581301 Dec 27 '22 at 05:41
  • Also worth making sure you've got the optimizer cranked up. Not much point to measuring the performance of code unless you have the optimizer on. – user4581301 Dec 27 '22 at 05:42
  • Worth seeing what happens if you can use a view for `token` instead of making a new string with `substr`. Also give [this a read](https://codereview.stackexchange.com/help/asking) and see if Code Review is right for you. – user4581301 Dec 27 '22 at 05:46
  • 2
    Use a ready-made CSV reader library. Also don't ise floating point to represent money. – n. m. could be an AI Dec 27 '22 at 05:49
  • So, let's talk optimization. Most of your time is spent in I/O. Input devices, like to keep streaming, that is when they are most efficient. Create a huge buffer, and block read into that buffer. Let's combine threads here. One thread will input into one or more buffers. Another thread will process the data in the buffers. You'll want to have enough buffers so that the processing thread doesn't wait for the input data. Maybe even delay the start of the input thread until more than one buffers are ready. – Thomas Matthews Dec 27 '22 at 05:52
  • I/O and conversion. Unsync with the C library functions. Factor out the separate tokenization and memory allocation step and combine it with the record-string→OrderBookEntry parsing. Use the `std::from_chars`. Those should be the biggest time sinks — but first _**profile**!_ – Dúthomhas Dec 27 '22 at 05:55
  • 1
    As others have said, don't use floating point. Integers are still processed faster than floating point. In U.S. currency, that would mean using pennies instead of dollars. You could change units to 1/1000 of a dollar for better accuracy. You'll have a more efficient program, but will it be noticeable? For example, you can loose all your optimization benefits in the duration of printing. Do you really need this kind of efficiency or is development schedule more important? – Thomas Matthews Dec 27 '22 at 05:57
  • 1
    Oh, yeah, seconding the ”don’t use float for money”. Use fixed point with _four_ decimal places of precision. – Dúthomhas Dec 27 '22 at 05:57
  • Let's talk another optimization: passing by reference versus passing by copy. Passing small sized things (size that fits in a processor's register), can be passed by copy with little execution penalties. However, larger data structures, like vectors, have a higher execution cost when making copies to pass or return from functions. Instead, pass these large data structures by reference or constant reference if they aren't modified. Use parameters instead of returning *{a copy of}* the large structures. – Thomas Matthews Dec 27 '22 at 06:03
  • I’m voting to close this question because the question belongs to https://codereview.stackexchange.com – 273K Dec 27 '22 at 06:06
  • @Dúthomhas unsyncing with stdio is only relevant for standard streams. – n. m. could be an AI Dec 27 '22 at 07:38
  • @user4581301 sorry im a bit new to c++ , how would i pass a const reference of my vector into the function OrderBookEntry? – Beginner_coder Dec 27 '22 at 07:52
  • 1
    Make sure that you are not measuring the performance of your program with it built in debug mode. Debug mode can add significantly to the execution time of a program. – Wilf Rosenbaum Dec 27 '22 at 07:56
  • Just like what it sounds like: `OrderBookEntry CSVReader::stringsToOBE(std::vector tokens)` becomes `OrderBookEntry CSVReader::stringsToOBE(const std::vector &tokens)`. [A good book](https://stackoverflow.com/q/388242/4581301) will explain the whys and wherefores. – user4581301 Dec 27 '22 at 16:28

0 Answers0