2

My ultimate goal is to use a fast csv parser in C++. I have looked at the following libraries:

I have also come across numerous stack-overflow questions regarding CSV Parsing such as:

My understanding that the fastest way to CSV parse is to use C ( obviously ), memory mapping, and multi-threading.

I've tried many of the solutions above, with csv2 coming out the fastest (https://github.com/p-ranav/csv2)

But none of these are even close to data.table's fread. I have tried looking through their source code (https://github.com/Rdatatable/data.table) to try and extract the fread implementation in C. But I am struggling to incorporate it into my C++ code.

I believe the relevant files are:

  • dt_stdio.h, fread.c, fread.h, and myomp.h

I was wondering if there was an easy way to compile the existing data.table solution into my C++ codebase.

I think my best solution so far is using csv2 (https://github.com/p-ranav/csv2). This gives very fast memory mapping time. I am struggling with parsing it quickly enough. Even if I just loop through the rows as in their documentation, my time goes to 2 seconds

csv2::Reader<csv2::delimiter<','>, 
        csv2::quote_character<'"'>, 
        csv2::first_row_is_header<true>,
        csv2::trim_policy::trim_whitespace> csv;
               
    if (csv.mmap(file_name)) {
        const auto header = csv.header();
        for (const auto &row: csv) {
            // if i only loop through rows --> 2 seconds
            for (const auto &cell: row) {
                // if i run both loops which is probably necessary for parsing --> 17 seconds

                // Do something with cell value
                // std::string value;
                // cell.read_value(value);
            }
        }
    }

EDIT::

I am using G++ 11.2.0 on Windows.

my G++ -O option flag was set to 0 previously. changing it to -O3 improved performance ( @Alan Birtles).

Even after changing compiler optimization settings, I get the following results pre-parsing:

Method Time to Read w/o Parsing Time to Read + Parse
data.table Not Applicable 2 seconds
csv2_reader .003 seconds 17 seconds
csv2_reader with += 1 in loops 6 seconds 17 seconds
fastcppcsvparser 2.5 seconds 14 seconds
csv_parser 17.5 seconds not worth running

Is there a way to get data.table's implementation into C++ without using Rcpp along with RInside?

Latest Question:

I just downloaded one of the benchmark data-sets. and get the same timing. Maybe I'm misunderstanding something. but adding +=1 to count the rows and columns in the loop slows it down from .001 seconds to 6seconds. which seems weird. and then using cell.read_raw_value slows it down even further.

so how am i supposed to access this data in C++ once its in a memory map? without the huge performance loss. Similar to whatever R's data.table does

Chat: https://chat.stackoverflow.com/rooms/242552/c-csv-parsing

M--
  • 25,431
  • 8
  • 61
  • 93
road_to_quantdom
  • 1,341
  • 1
  • 13
  • 20
  • 1
    Have you enabled compiler optimisations? – Alan Birtles Mar 01 '22 at 20:36
  • Please add more information to your post (OS, compiler versions, library versions etc.) – kwsp Mar 01 '22 at 20:40
  • @AlanBirtles omg. i'm relatively new to C++. I didn't realize this was even an option. I believe the speed is significantly improved. I will have to read/learn about what this does, and then will come back to update my question or mark it as solved – road_to_quantdom Mar 01 '22 at 20:45
  • You compile your code without optimizations while still working on / debugging it, because optimized code can make interactive debugging a bit... weird. But once you have assertained that the code works as expected, you turn optimizations on. This was true for C already, but C++ can do some *really strong* code massaging (as you might have seen). However, the executable code generated can deviate significantly from the source input. – DevSolar Mar 01 '22 at 20:50
  • changing the compiler optimization greater improved its ability to memory map the csv file. but when i loop through and use `read_value` or `read_raw_value` for each element, the time slows down to 17 seconds. i will try other implementations and continue – road_to_quantdom Mar 01 '22 at 23:36
  • Note: for a file as big as you're trying to parse you probably want to parse in "Streaming mode" which is a forward only parser and will be much much faster than fully parsing the file. One of the libraries you're saying "isnt' worth testing" supports that https://github.com/vincentlaucsb/csv-parser#features--examples Also you'll likely want to use "String view mode" which will allocate less. – Mgetz Mar 02 '22 at 17:26
  • @Mgetz i'm not sure what you mean. if it's using `std::ifstream` then this doesn't improve performance. it actually slows it down. also, the reason its not "worth it" is because R's data.table can read the file AND parse it in 2 seconds. which is the type of performance I am looking for out of C++. Also the other packages `csv2` and `fastcppparser` are more performant than `csv-parser` – road_to_quantdom Mar 02 '22 at 17:46
  • @road_to_quantdom no streaming mode refers to only keeping a pointer to what you're parsing *right now* and nothing else. You can still do that memory mapped. But it basically doesn't try to hold the entire CSV in a parsed state in memory. Just the row you're currently using. This massively lowers memory usage and allocations and creates a much faster result when you're just reading in data to internal data structures. This is also a common tactic for dealing with massive XML files where having the entire parsed DOM is prohibitive. – Mgetz Mar 02 '22 at 17:53
  • @Mgetz i'm not sure how `csv-parser` implements what you're saying above – road_to_quantdom Mar 02 '22 at 20:25

2 Answers2

0

The question "can I call this c code from c++" is "yes you can" (unless there is something truly weird going on. Have to avoid name mangling tho

the trick is this

extern "C" {
  #include "somecode.h"
}

see Call a C function from C++ code

But really c++ should be able to produce a csv parser that is the same speed as a c one, there is nothing that c can do that c++ cannot

pm100
  • 48,078
  • 23
  • 82
  • 145
  • so it seems as if the C++ code in `csv2` creates a memory map fast enough to be comparable to `R's data.table`. but whenever I try to parse it, time slows down significantly. is there an efficient way to convert a memory map to some sort of C++ data structure? or specific way to access memory map in C++ efficiently? the purpose of including the external C code is to be able to do things with the data in C++. but the same speed as R data.table's fread is necessary – road_to_quantdom Mar 01 '22 at 23:55
  • @road_to_quantdom how big is your csv file? – pm100 Mar 02 '22 at 00:04
  • @road_to_quantdom have you tried csv2 benchmark files, do you get the same results? – pm100 Mar 02 '22 at 00:19
  • 17369353 rows and 28 columns. ~3-4GB. will run benchmarks on csv2 now and update question – road_to_quantdom Mar 02 '22 at 02:56
  • I just downloaded one of the benchmark data-sets. and get the same timing. Maybe I'm misunderstanding something. but adding `+=1` to count the rows and columns in the loop slows it down from .001 seconds to 6seconds. which seems weird. and then using cell.read_raw_value slows it down even further. so how am i supposed to access this data in C++ once its in a memory map? – road_to_quantdom Mar 02 '22 at 03:20
  • I think this answer is too generic. It would be useful if it would present how to import to C++ the actual data.table fread C code in a run-able manner. – jangorecki Mar 02 '22 at 10:15
  • @road_to_quantdom the reason adding 1 makes a difference is that with a release build everything will pretty much be dropped since those loops in the sample dont actually do anything, – pm100 Mar 02 '22 at 21:15
0

I have tried the following in C++

// main.cpp
extern "C" {
    #include <fread.h>
}

#include <iostream>
int main(int argc, char* argv[]) {

    return 0;
}

I then use the following to compile and link:

gcc -Iinclude -c -o fread.o fread.c
g++-Iinclude -c -o main.o main.cpp
g++ -Iinclude -o main.exe main.o fread.o

But then get the following errors on compiling:

x86_64-w64-mingw32/bin/ld.exe: fread.o:fread.c:(.text+0x1d0): undefined reference to `libintl_dgettext'
x86_64-w64-mingw32/bin/ld.exe: fread.o:fread.c:(.text+0x1da): undefined reference to `Rprintf'
x86_64-w64-mingw32/bin/ld.exe: fread.o:fread.c:(.text+0x42e): undefined reference to `libintl_dgettext'
x86_64-w64-mingw32/bin/ld.exe: fread.o:fread.c:(.text+0x447): undefined reference to `__halt'
... alot more things

I have included the relevant files to compile in the include folder in my directory. And started this issue for potential C++ implementation: https://github.com/Rdatatable/data.table/issues/5343

road_to_quantdom
  • 1,341
  • 1
  • 13
  • 20