Reading a large text file in parallel in C++

Question

I have a large textfile.. I want to read this file and perform some manipulation in it..

This manipulation occurs independently on each line. So basically, I am looking for some function which can do this parallel.

void readFile(string filename){

  //do manipulation

}

That do manipulation can happen in parallel.

Agreed that this can be done easily using hadoop but that is an overkill solution. (Its large file but not that large that I need hadoop for this...)

How do I do this in C++?

Create a range of lines to read for each thread, evenly distribute it if possible. The rest is just using the language feature to accomplish this, to which you should do some research — dchhetri, Jun 21 '13 at 20:23
You need a work queue for the threads: http://vichargrave.com/multithreaded-work-queue-in-c/ — ctn, Jun 21 '13 at 20:25
It probably doesn't make sense to put too much work into this because when it all comes down to it you're going to find yourself limited by the disk access speed anyway. If you try to read from multiple threads, you're going to end up losing performance due to thrashing. You could use a producer-consumer approach, but you might not experience speedups like you want because of the overhead of threading. — Wug, Jun 21 '13 at 20:25
Can you give us SOME idea of how much work you are doing for each line? E.g. if you have a 100 lines of 70 characters each, how long does it take to process? Like others have said, unless you can get it off the disk faster than your processing, you are not going to gain anything from threads. My disk delivers around 50-60MB/s in ideal case. [Also, you may want to try using "memory mapping" the file. — Mats Petersson, Jun 21 '13 at 20:29
I have around 5GB of data and the data looks like 0.1234,0.3443,.. roughly 20-25 such numbers..\n Now, I want to pass this data into matrix (using boost blas libraries) — frazman, Jun 21 '13 at 20:32
Then you have about zero processing. 99.9% of your time is for getting data from the disk, and you can't make the reading itself parallel. — Guilherme Bernal, Jun 21 '13 at 20:36

score 7 · Accepted Answer · edited May 23 '17 at 12:34

7

I would use mmap for that. mmap gives you memory-like access to file so you can easly read in parallel. Please look at another stackoverflow topic about mmap. Be careful when usin non-read-only pattern with mmap.

edited May 23 '17 at 12:34

Community

1
1

answered Jun 21 '13 at 21:00

spinus

5,497
2
20
26

1

Although you can read in parallel, reading sequentially is likely to be much faster as it is more cache-friendly. Anyway you can never be sure without a proper benchmark. – Guilherme Bernal Jun 21 '13 at 21:14
Of course it depends. The proper data-flow needs to be benchmarked as you said. mmap is pretty neat option in some cases. It is just good to know about it. – spinus Jun 21 '13 at 21:37

score 4 · Answer 2 · answered Jun 21 '13 at 23:17

If I were to be faced with this problem and have to solve it, I'd just use a single threaded approach, it's not worth it to put too much effort into it without speeding up the underlying medium.

Say you have this on a ramdisk, or a really fast raid, or something else, or the processing is somehow massively lopsided. Regardless of the scenario, line processing now takes the majority of the time.

I'd structure my solution something like this:

class ThreadPool; // encapsulates a set of threads
class WorkUnitPool; // encapsulates a set of threadsafe work unit queues
class ReadableFile; // an interface to a file that can be read from

ThreadPool pool;
WorkUnitPool workunits;
ReadableFile file;

pool.Attach(workunits); // bind threads to (initially empty) work unit pool

file.Open("input.file")
while (!file.IsAtEOF()) workunits.Add(ReadLineFrom(file));

pool.Wait(); // wait for all of the threads to finish processing work units

My "solution" is a generic, high level design intended to provoke thinking of what tools you have available that you can adapt to your needs. You will have to think carefully in order to use this, which is what I want.

As with any threaded operation, be very careful to design it properly, otherwise you will run into race conditions, data corruption, and all manner of pain. If you can find a thread pool/work unit library that does this for you, by all means use that.

score 3 · Answer 3 · edited May 23 '17 at 12:02

I suggest you use something like fread to read many lines into a buffer and then operate on the buffer in parallel.

http://www.cplusplus.com/reference/cstdio/fread/

I once read an image one pixel (int) at a time, did a conversion to the pixel and then wrote the value to a buffer. That took well over 1 minute for a large file. When i instead used fread to read the whole file into a buffer first and then do the conversion on the buffer in memory it took less than one second for the whole operation. That's a huge improvement without using any parallelism.

Since your file is so large you can read it in in chucks, operate on the chunk in parallel and then read in the next chuck. You could even read the next chuck (with one thread) while you're processing the previous chuck in parallel (with e.g. 7 threads) but you might find that's not even necessary. Personally, I would do the parallelism with OpenMP.

Edit: I forgot to mention that I gave an answer to use fread to read in a file and process the lines in parallel with OpenMP openmp - while loop for text file reading and using a pipeline It would probably be simple to modify that code to do what you want to do.

Reading a large text file in parallel in C++

3 Answers3

Linked

Related