Parse very large CSV files with C++

Question

My goal is to parse large csv files with C++ in a QT project in OSX environment. (When I say csv I mean tsv and other variants 1GB ~ 5GB ).

It seems like a simple task , but things get complicated when file sizes get bigger. I don't want to write my own parser because of the many edge cases related to parsing csv files.

I have found various csv processing libraries to handle this job, but parsing 1GB file takes about 90 ~ 120 seconds on my machine which is not acceptable. I am not doing anything with the data right now, I just process and discard the data for testing purposes.

cccsvparser is one of the libraries I have tried . But the the only fast enough library was fast-cpp-csv-parser which gives acceptable results: 15 secs on my machine, but it works only when the file structure is known.

Example using: fast-cpp-csv-parser

#include "csv.h"

int main(){
    io::CSVReader<3> in("ram.csv");
    in.read_header(io::ignore_extra_column, "vendor", "size", "speed");
    std::string vendor; int size; double speed;
    while(in.read_row(vendor, size, speed)){
    // do stuff with the data
    }
}

As you can see I cannot load arbitrary files and I must specifically define variables to match my file structure. I'm not aware of any method that allows me to create those variables dynamically in runtime .

The other approach I have tried is to read csv file line by line with fast-cpp-csv-parser LineReader class which is really fast (about 7 secs to read whole file), and then parse each line with cccsvparser lib that can process strings, but this takes about 40 seconds until done, it is an improvement compared to the first attempts but still unacceptable.

I have seen various Stack Overflow questions related to csv file parsing none of them takes large file processing in to account.

Also I spent a lot of time googling to find a solution to this problem, and I really miss the freedom that package managers like npm or pip offer when searching for out of the box solutions.

I will appreciate any suggestion about how to handle this problem.

Edit:

When using @fbucek's approach, processing time reduced to 25 seconds, which is a great improvement.

can we optimize this even more?

Why not read the first line with conventional methods, analyse it for number/names of the header and use it as input for the code as you have shown? — Bowdzone, Dec 10 '14 at 13:29
What kind of disk are these multi-GB sized files stored on ? SSD or Ordinary Magnetic Disk ? — Malkocoglu, Dec 10 '14 at 15:06
@Bowdzone : i have thought about this , but i cannot understand how to approach this ? can you show an example of what you have in mind ? — Alexander, Dec 10 '14 at 15:49
@Malkocoglu in my case its Magnetic disk , but it should be able to run on both . — Alexander, Dec 10 '14 at 15:50
@Alexander: I think in this (disk not SSD) case, the time is spent to read data from the disk. Profile your app and see if it really utilizes CPU or mostly waiting for I/O... — Malkocoglu, Dec 10 '14 at 17:07
If you really have to handle such a huge files, you would need comparable resources. Google found the answer for similar question - "read/write from multiple physical disks, when processing them concurrently should give you a boost in performance." — SChepurin, Dec 10 '14 at 17:55
From fast-cpp-csv-parser: `Disk I/O and CSV-parsing are overlapped using threads for efficiency.` and `Can read multiple GB files in reasonable time.` How many cores do you have? Have you checked whether the library makes it parallel for you, already? [Have you linked against pthread](https://code.google.com/p/fast-cpp-csv-parser/issues/detail?id=1&can=1)? — László Papp, Dec 14 '14 at 08:03
Yes , pthread is enabled , but as i stated in the question , while fast-cpp-csv-parser has good performance , i cannot understand how can i use it in my project . — Alexander, Dec 14 '14 at 09:06
I have 4 cores and from what i can see this is not running in parallel . — Alexander, Dec 14 '14 at 19:28
I am sorry , i must be explained myself wrong . The "fast" library utilizes all of my cores and gives me the results i need . but i cannot wrap my brain around how to use it in case of opening every csv file (without knowing the structure) . I have used other lib to construct a working version based on @fbucek's answer, which has relatively poor performance. — Alexander, Dec 15 '14 at 05:25
@Alexander Is there a reasonable upper limit on the number of columns in your input? If so you could probably use the "fast" library. — cartographer, Dec 16 '14 at 19:49
I forgot to mention, it looks like fast-cpp-csv-parser can't handle embedded newlines in fields, even if they're quoted, so if you need to be able to handle them you should mention it (being able to ignore them simplifies parallelization considerably). — cartographer, Dec 16 '14 at 20:32
It is a good idea! guess i can manage without newline handling in favor of performance . — Alexander, Dec 16 '14 at 20:49
`i'm not aware of any method that allows me to create those variables dynamically in runtime` -> Do you mean each line is very different when it comes to the type? If it is only the line types that change, but they are consistent to each other, why cannot you read in one line and decide about the type dynamically that way? Obviously, that will not help if the whole is completely inconsistent, so it is insufficient details that you provided about the input type. Please elaborate more! And of course, you can always use string which is "dynamic". — László Papp, Dec 17 '14 at 05:15
What are you doing with the data? Have you considered loading the data into a database and accessing it that way? — Mark Setchell, Dec 17 '14 at 18:30
How about mixing the @fbucek approach and using mmap calls ? if you're operating in enough memory space a single mmap call is enough to load entire file in memory. its theoretically 2x faster. you can divide the mapped region in small chunks and distribute it amongst the processing threads. — Mithun Ganatra, Dec 17 '14 at 20:30
@lpapp : Memory mapping has a potential for a huge speed advantage compared to traditional IO. It lets the operating system read the data from the source file as the pages in the memory mapped file are touched. This works by creating faulting pages, which the OS detects and then the OS loads the corresponding data from the file automatically. This works the same way as the paging mechanism and is usually optimized for high speed I/O by reading data on system page boundaries and sizes (usually 4K) - a size for which most file system caches are optimized to. — Mithun Ganatra, Dec 18 '14 at 11:19
Sure, if you have a server farm to read a huge CSV file, you can do that. — László Papp, Dec 18 '14 at 11:21
@lpapp yes i may have header in the files . Types may change depending on the file being loaded , most important part that column count changes according to loaded file . — Alexander, Dec 18 '14 at 18:32
@MarkSetchell This is exactly what i am trying to do , but i need to parse the data first . — Alexander, Dec 18 '14 at 18:37
@Alexander mmap has an offset and a size value - you could theoretically load a good sized chunk - say 1 GB into memory, and read off it from multiple threads. (see ans#2 in http://stackoverflow.com/questions/45972/mmap-vs-reading-blocks) and repeat the process again. That would address the main IO bottleneck while being scalable. — Fox, Dec 19 '14 at 02:55
Is it at all an option to write a post-processing tool for your CSV file that would find the maximum length in bytes of all fields, and generate an even bigger but fixed-width-field CSV file? You gain the ability to seek randomly within the file and not have to read it all. As well, it is trivial to compute offsets in order to split the work equally between multiple threads. — Iwillnotexist Idonotexist, Dec 19 '14 at 14:57
@Alexander: `can we optimize this even more ?` -> Yes, I answered that questeion, grabbing the important bits out of the useful comment and answer parts. — László Papp, Dec 20 '14 at 07:54

fbucek · Accepted Answer · 2014-12-11T07:53:58.907

I am assuming you are using only one thread.

Multithreading can speedup your process.

Best accomplishment so far is 40 sec. Let's stick to that.

I have assumed that first you read then you process -> ( about 7 secs to read whole file)

7 sec for reading 33 sec for processing

First of all you can divide your file into chunks, let's say 50MB. That means that you can start processing after reading 50MB of file. You do not need to wait till whole file is finished. That's 0.35 sec for reading ( now it is 0.35 + 33 second for processing = cca 34sec )

When you use Multithreading, you can process multiple chunks at a time. That can speedup process theoretically up to number of your cores. Let's say you have 4 cores. That's 33/4 = 8.25 sec.

I think you can speed up you processing with 4 cores up to 9 s. in total.

Look at QThreadPool and QRunnable or QtConcurrent I would prefer QThreadPool

Divide task into parts:

First try to loop over file and divide it into chunks. And do nothing with it.
Then create "ChunkProcessor" class which can process that chunk
Make "ChunkProcessor" a subclass of QRunnable and in reimplemented run() function execute your process
When you have chunks, you have class which can process them and that class is QThreadPool compatible, you can pass it into

It could look like this

loopoverfile {
  whenever chunk is ready {
     ChunkProcessor *chunkprocessor = new ChunkProcessor(chunk);
     QThreadPool::globalInstance()->start(chunkprocessor);
     connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
  }   
}

You can use std::share_ptr to pass processed data in order not to use QMutex or something else and avoid serialization problems with multiple thread access to some resource.

Note: in order to use custom signal you have to register it before use

qRegisterMetaType<std::shared_ptr<ProcessedData>>("std::shared_ptr<ProcessedData>");

Edit: (based on discussion, my answer was not clear about that) It does not matter what disk you use or how fast is it. Reading is single thread operation. This solution was suggested only because it took 7 sec to read and again does not matter what disk it is. 7 sec is what's count. And only purpose is to start processing as soon as possible and not to wait till reading is finished.

You can use:

QByteArray data = file.readAll();

Or you can use principal idea: ( I do not know why it take 7 sec to read, what is behind it )

 QFile file("in.txt");
 if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
   return;

 QByteArray* data = new QByteArray;    
 int count = 0;
 while (!file.atEnd()) {
   ++count;
   data->append(file.readLine());
   if ( count > 10000 ) {
     ChunkProcessor *chunkprocessor = new ChunkProcessor(data);
     QThreadPool::globalInstance()->start(chunkprocessor);
     connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
     data = new QByteArray; 
     count = 0;
   }
 }

One file, one thread, read almost as fast as read by line "without" interruption. What you do with data is another problem, but has nothing to do with I/O. It is already in memory. So only concern would be 5GB file and ammout of RAM on the machine.

It is very simple solution all you need is subclass QRunnable, reimplement run function, emit signal when it is finished, pass processed data using shared pointer and in main thread joint that data into one structure or whatever. Simple thread safe solution.

Multithreading wouldn't help with disk I/O. Read this thread for explanation - http://stackoverflow.com/questions/902425/does-multithreading-make-sense-for-io-bound-operations. — SChepurin, Dec 10 '14 at 17:53
@SChepurin - 1) multithreading in this example has nothing to do with disk I/O. Not to mention mulitple I/O operation. Here is one file read in one thread as usual. 2) And it is not true anyway. It depends on disk type, task you are performing etc. When you have everything optimised writing multiple files won't help. But sometimes bottle neck is not disk. — fbucek, Dec 10 '14 at 18:26
Reading file is disk I/O. He needs just that. Optimize it or not you will not get the results promised. — SChepurin, Dec 10 '14 at 19:43
@SChepurin This is not a case of pure disk I/O. It's disk I/O followed by processing. Judging by the reported throughput (1GB in 100s = 10MB/s) the disk I/O is almost certainly not the bottleneck (even old magnetic disks are an order of magnitude faster than that), so dividing the work (even into two threads, one to read and one to process) is very likely to improve throughput. significantly — nobody, Dec 10 '14 at 22:00
@Andrew Medico - "...very likely to improve throughput. significantly" Saying that, you could propose the solution. The only thing i can agree with is that 100s is too long for these files. — SChepurin, Dec 11 '14 at 03:08
Multithreading will only ever increase the performance if there are multiple cores. — László Papp, Dec 13 '14 at 04:33
i have tested your solution , 25 sec to process the file . great improvement ! — Alexander, Dec 13 '14 at 09:17
@Alexander: it means that you are using a crappy library. A good CSV library _should_ do the parallel computation and not each end user! — László Papp, Dec 13 '14 at 12:25

score 4 · Answer 2 · answered Dec 17 '14 at 07:47

I would propose a multi-thread suggestion with a slight variation is that one thread is dedicated to reading file in predefined (configurable) size of chunks and keeps on feeding data to a set of threads (more than one based cpu cores). Let us say that the configuration looks like this:

chunk size = 50 MB
Disk Thread = 1
Process Threads = 5

Create a class for reading data from file. In this class, it holds a data structure which is used to communicate with process threads. For example this structure would contain starting offset, ending offset of the read buffer for each process thread. For reading file data, reader class holds 2 buffers each of chunk size (50 MB in this case)
Create a process class which holds a pointers (shared) for the read buffers and offsets data structure.
Now create driver (probably main thread), creates all the threads and waiting on their completion and handles the signals.
Reader thread is invoked with reader class, reads 50 MB of the data and based on number of threads creates offsets data structure object. In this case t1 handles 0 - 10 MB, t2 handles 10 - 20 MB and so on. Once ready, it notifies processor threads. It then immediately reads next chunk from disk and waits for processor thread to completion notification from process threads.
Processor threads on the notification, reads data from buffer and processes it. Once done, it notifies reader thread about completion and waits for next chunk.
This process completes till the whole data is read and processed. Then reader thread notifies back to the main thread about completion which sends PROCESS_COMPLETION, upon all threads exits. or main thread chooses to process next file in the queue.

Note that offsets are taken for easy explanation, offsets to line delimiter mapping needs to be handled programmatically.

true, We can discuss merits and demerits of the approach before making any comment on it. I have implemented this kind of algorithm and achieved improvements in performance. If code is written in modular way, playing with multi threads while bench marking is not a problem at all. To explain further: there are 2 resources which are blocking on each other: 1. Disk 2. Processor This algorithm separates them and the reader thread does a sequential read on the file and sequential RW is way faster than random RW on traditional disks. — bsr, Dec 17 '14 at 18:04
:). the library has a restriction that it works only when file structure is known as already mentioned in the question. — bsr, Dec 17 '14 at 18:09
I can not add comment to your answer due to my reputation. I have seen the code of the library and they use async io to read the file. Which is cleaner way than what I have suggested. But it does not have multiple threads to process the csv lines. — bsr, Dec 17 '14 at 18:37
`Disk I/O and CSV-parsing are overlapped using threads for efficiency.` — László Papp, Dec 17 '14 at 18:39

Senthu Sivasambu · Answer 3 · 2014-12-18T15:37:36.080

0

If the parser you have used is not distributed obviously the approach is not scalable.

I would vote for a technique like this below

chunk up the file into a size that can be handled by a machine / time constraint
distribute the chunks to a cluster of machines (1..*) that can meet your time/space requirements
consider dealing at block sizes for a given chunk
Avoid threads on same resource (i.e given block) to save yourself from all thread related issues.
Use threads to achieve non competing (on a resource) operations - such as reading on one thread and writing on a different thread to a different file.
do your parsing (now for this small chunk you can invoke your parser).
do your operations.
merge the results back / if can distribute them as they are..

Now, having said that, why can't you use Hadoop like frameworks?

edited Dec 18 '14 at 15:37

answered Dec 18 '14 at 15:03

Senthu Sivasambu

181
1
3
12

1

Just out of curiosity: how is it different from the rest of the answers? – László Papp Dec 18 '14 at 15:06
difference is -answer is intended to be pure 'technique' no suggestion of how that could be implemented (programming code / design patters and so on).. given the original question and problem desciption, I wanted to emphasize that it must be a "distributed solution" to scale..the very reason Hadoop is here and thriving.. From a Ctrl+F, I could not find this word before my answer..now why is that difficult to grasp I wonder? – Senthu Sivasambu Dec 18 '14 at 15:19
parallel computing <> distributed computing .. given my theoretical understanding distributed computing is a parallel computing technique but perhaps not vise verse - just to make a point. – Senthu Sivasambu Dec 18 '14 at 15:30
look I even found you a link :) http://cs.stackexchange.com/questions/1580/distributed-vs-parallel-computing – Senthu Sivasambu Dec 18 '14 at 15:33

Parse very large CSV files with C++

3 Answers3

Linked