My ultimate goal is to use a fast csv parser in C++. I have looked at the following libraries:
- https://github.com/ben-strasser/fast-cpp-csv-parser
- https://github.com/vincentlaucsb/csv-parser#integration
- https://github.com/p-ranav/csv2
I have also come across numerous stack-overflow questions regarding CSV Parsing such as:
- Fastest way to get data from a CSV in C++
- Parse very large CSV files with C++
- Reason behind speed of fread in data.table package in R
My understanding that the fastest way to CSV parse is to use C ( obviously ), memory mapping, and multi-threading.
I've tried many of the solutions above, with csv2
coming out the fastest (https://github.com/p-ranav/csv2)
But none of these are even close to data.table
's fread
. I have tried looking through their source code (https://github.com/Rdatatable/data.table) to try and extract the fread
implementation in C. But I am struggling to incorporate it into my C++ code.
I believe the relevant files are:
dt_stdio.h
,fread.c
,fread.h
, andmyomp.h
I was wondering if there was an easy way to compile the existing data.table solution into my C++ codebase.
I think my best solution so far is using csv2
(https://github.com/p-ranav/csv2). This gives very fast memory mapping time. I am struggling with parsing it quickly enough. Even if I just loop through the rows as in their documentation, my time goes to 2 seconds
csv2::Reader<csv2::delimiter<','>,
csv2::quote_character<'"'>,
csv2::first_row_is_header<true>,
csv2::trim_policy::trim_whitespace> csv;
if (csv.mmap(file_name)) {
const auto header = csv.header();
for (const auto &row: csv) {
// if i only loop through rows --> 2 seconds
for (const auto &cell: row) {
// if i run both loops which is probably necessary for parsing --> 17 seconds
// Do something with cell value
// std::string value;
// cell.read_value(value);
}
}
}
EDIT::
I am using G++ 11.2.0 on Windows.
my G++ -O option flag was set to 0 previously. changing it to -O3 improved performance ( @Alan Birtles).
Even after changing compiler optimization settings, I get the following results pre-parsing:
Method | Time to Read w/o Parsing | Time to Read + Parse |
---|---|---|
data.table | Not Applicable | 2 seconds |
csv2_reader | .003 seconds | 17 seconds |
csv2_reader with += 1 in loops | 6 seconds | 17 seconds |
fastcppcsvparser | 2.5 seconds | 14 seconds |
csv_parser | 17.5 seconds | not worth running |
Is there a way to get data.table
's implementation into C++
without using Rcpp
along with RInside
?
Latest Question:
I just downloaded one of the benchmark data-sets. and get the same timing. Maybe I'm misunderstanding something. but adding +=1
to count the rows and columns in the loop slows it down from .001 seconds to 6seconds. which seems weird. and then using cell.read_raw_value slows it down even further.
so how am i supposed to access this data in C++ once its in a memory map? without the huge performance loss. Similar to whatever R's data.table
does
Chat: https://chat.stackoverflow.com/rooms/242552/c-csv-parsing