Read large file in parallel?

Question

I have a large file which I need to read in and make a dictionary from. I would like this to be as fast as possible. However my code in python is too slow. Here is a minimal example that shows the problem.

First make some fake data

paste <(seq 20000000) <(seq 2 20000001)  > largefile.txt

Now here is a minimal piece of python code to read it in and make a dictionary.

import sys
from collections import defaultdict
fin = open(sys.argv[1])

dict = defaultdict(list)

for line in fin:
    parts = line.split()
    dict[parts[0]].append(parts[1])

Timings:

time ./read.py largefile.txt
real    0m55.746s

However it is possible to read the whole file much faster as:

time cut -f1 largefile.txt > /dev/null    
real    0m1.702s

My CPU has 8 cores, is it possible to parallelize this program in python to speed it up?

One possibility might be to read in large chunks of the input and then run 8 processes in parallel on different non-overlapping subchunks making dictionaries in parallel from the data in memory then read in another large chunk. Is this possible in python using multiprocessing somehow?

Update. The fake data was not very good as it had only one value per key. Better is

perl -E 'say int rand 1e7, $", int rand 1e4 for 1 .. 1e7' > largefile.txt

(Related to Read in large file and make dictionary .)

@sihrc Most of the time is spent in dict[parts[0]].append(parts[1]) according to the profiling but append does take around 10% of the time. Note that in this fake case lists only have length 1 but in general I expect them to be longer. — Simd, Aug 07 '13 at 13:23
That's what I mean. That .append part is slowing you down. Have you thought of alternatives to appending large lists? What kind of data are you appending? — sihrc, Aug 07 '13 at 13:25
You cannot parallelize I/O with only one disk. If you have much RAM you can read the whole file into memory with `readlines()` and process them directly in RAM. — Bakuriu, Aug 07 '13 at 13:39
Yes, but reading the file isn't the slow part. it's the appending lists in the dictionary, which can be parallelized — sihrc, Aug 07 '13 at 13:41
Unless you slurp the whole file into memory (bad practice, don't do it), you're going to see much better performance from a basic serial approach. That's because with a parallel approach the OS wastes a lot of time going back to the filesystem to seek() different areas of the file to pass to the thread/process responsbile for working on that area. It's also pointless to parallelize more than the number of cores you have. — Michael Martinez, Jul 26 '22 at 14:50

Useless · Answer 1 · 2013-08-17T10:18:06.477

It may be possible to parallelize this to speed it up, but doing multiple reads in parallel is unlikely to help.

Your OS is unlikely to usefully do multiple reads in parallel (the exception is with something like a striped raid array, in which case you still need to know the stride to make optimal use of it).

What you can do, is run the relatively expensive string/dictionary/list operations in parallel to the read.

So, one thread reads and pushes (large) chunks to a synchronized queue, one or more consumer threads pulls chunks from the queue, split them into lines, and populate the dictionary.

(If you go for multiple consumer threads, as Pappnese says, build one dictionary per thread and then join them).

Hints:

... push chunks to a synchronized queue ...
... one or more consumer threads ...

Re. bounty:

C obviously doesn't have the GIL to contend with, so multiple consumers are likely to scale better. The read behaviour doesn't change though. The down side is that C lacks built-in support for hash maps (assuming you still want a Python-style dictionary) and synchronized queues, so you have to either find suitable components or write your own. The basic strategy of multiple consumers each building their own dictionary and then merging them at the end is still likely the best.

Using strtok_r instead of str.split may be faster, but remember you'll need to manage the memory for all your strings manually too. Oh, and you need logic to manage line fragments too. Honestly C gives you so many options I think you'll just need to profile it and see.

I've given a couple of hints to get you started. Write some code, and ask if you have trouble with it. — Useless, Aug 07 '13 at 15:24
Are multiple threads going to work? I am worried about the GIL. — Simd, Aug 07 '13 at 15:29
The producer thread is I/O bound, so if the GIL is released while it thread blocks on the read system call, _one_ consumer thread can make progress computing hashes and updating its dictionary. Multiple consumer threads are very likely to contend for the GIL though. — Useless, Aug 07 '13 at 15:42
This answer to a similar question might help you: https://stackoverflow.com/a/43079667/196732 I'm not sure about the joining then result together at the other end...? — CpILL, May 02 '19 at 06:08
the multiprocessing module avoids the GIL. But still doesn't help in this case for reasons already given. — Michael Martinez, Jul 26 '22 at 14:46

score 3 · Answer 2 · answered Aug 12 '13 at 02:09

It does seem tempting to think that using a processing pool will solve problems like this, but it's going to end up being a good bit more complicated than that, at least in pure Python.

Because the OP mentioned that the lists on each input line would be longer in practice than two elements, I made a slightly-more-realistic input file using :

paste <(seq 20000000) <(seq 2 20000001) <(seq 3 20000002) |
  head -1000000 > largefile.txt

After profiling the original code, I found the slowest portion of the process to be the line-splitting routine. (.split() took approximately 2x longer than .append() on my machine.)

1000000    0.333    0.000    0.333    0.000 {method 'split' of 'str' objects}
1000000    0.154    0.000    0.154    0.000 {method 'append' of 'list' objects}

So I factored the split into another function and use a pool to distribute the work of splitting the fields :

import sys
import collections
import multiprocessing as mp

d = collections.defaultdict(list)

def split(l):
    return l.split()

pool = mp.Pool(processes=4)
for keys in pool.map(split, open(sys.argv[1])):
    d[keys[0]].append(keys[1:])

Unfortunately, adding the pool slowed things down by more than 2x. The original version looked like this :

$ time python process.py smallfile.txt 
real    0m7.170s
user    0m6.884s
sys     0m0.260s

versus the parallel version :

$ time python process-mp.py smallfile.txt 
real    0m16.655s
user    0m24.688s
sys     0m1.380s

Because the .map() call basically has to serialize (pickle) each input, send it to the remote process, and then deserialize (unpickle) the return value from the remote process, using a pool in this way is much slower. You do get some improvement by adding more cores to the pool, but I'd argue that this is fundamentally the wrong way to distribute this work.

To really speed this up across cores, my guess is that you'd need to read in large chunks of the input using some sort of fixed block size. Then you could send the entire block to a worker process and get serialized lists back (though it's still unknown how much the deserialization here will cost you). Reading the input in fixed-size blocks sounds like it might be tricky with the anticipated input, however, since my guess is that each line isn't necessarily the same length.

I am surprised the dictionary creation wasn't the slowest part for you as it is in my profiling. About the line length I think I wasn't clear. Each line will not be exactly the same length but will split into the same number of parts. My new fake data example is more realistic that I just added. — Simd, Aug 12 '13 at 06:45
pool.map(split,open(sys.argv[1]),10000) will group lines by 10000-length chunks, but execution time not differs then non-chunked version. — eri, Aug 16 '13 at 10:05
"some sort of fixed block size" it sounds good. Would you put some details? — Cloud Cho, Aug 09 '23 at 20:48

score 2 · Accepted Answer · answered Aug 16 '13 at 09:28

2

There was a blog post series "Wide Finder Project" several years ago about this at Tim Bray's site [1]. You can find there a solution [2] by Fredrik Lundh of ElementTree [3] and PIL [4] fame. I know posting links is generally discouraged at this site but I think these links give you better answer than copy-pasting his code.

[1] http://www.tbray.org/ongoing/When/200x/2007/10/30/WF-Results
[2] http://effbot.org/zone/wide-finder.htm
[3] http://docs.python.org/3/library/xml.etree.elementtree.html
[4] http://www.pythonware.com/products/pil/

answered Aug 16 '13 at 09:28

Ecir Hana

10,864
13
67
117

1

2 links out of 4 are dead. Now it would make sense to have the content copied here too. – DanielTuzes Jan 16 '21 at 00:42
1

@DanielTuzes agree, for others, you can always check https://web.archive.org/, for these cases: https://web.archive.org/web/20201109032649/http://effbot.org/zone/wide-finder.htm, and https://web.archive.org/web/20201121102218/http://www.pythonware.com/products/pil/ – CodeNoob May 06 '21 at 12:51

score 1 · Answer 4 · answered Aug 07 '13 at 13:26

1

One thing you could try is to get the line count from the file, then spawn 8 threads that makes a dictionary from 1/8th of the file each, then join the dictionaries when all threads are finished. This will probably speed it up if it is the appending that takes time and not the reading of the lines.

answered Aug 07 '13 at 13:26

oyhovd

609
5
7

3

Since this is a CPU-bound problem, [threading isn't going to help much](http://www.linuxforu.com/2011/01/python-threading-and-its-caveats/) on CPython. The normal solution is to use multiprocessing instead, but that does not allow sharing a python dict amongst processes. Therefore there must be a single process with a dict that has the task of adding to it, and other processes for reading and parsing that send along their results using a queue or pipe. I don't think you'll be able to achieve a speedup of more than 3x using this method, because of the overhead. – Lauritz V. Thaulow Aug 07 '13 at 14:35
@LauritzV.Thaulow What about loading the whole thing into memory then building 8 dicts and then merging them? Does that sound sensible? – Simd Aug 07 '13 at 14:57
1

@Anush Well, the dicts have to be serialized and rebuilt when moved from one process to another, and since the rebuilding will require almost as much work as the building, it defeats much of the purpose. Then the rebuilt dicts must be merged, and here my python knowledge is lacking, because I don't know if `d1.update(d2)` can do some shortcut that makes it much faster than looping over `d2` and adding key-value pairs one by one. In any case I doubt it will help much. I understand IronPython does not have the CPU-bound threading problem, so if you can use that then Useless' answer should work. – Lauritz V. Thaulow Aug 07 '13 at 21:57

score 1 · Answer 5 · answered Aug 16 '13 at 20:28

1

More cardinal solution for slow dictionary appending: replace the dictionary with array of pairs of strings. Fill it and then sort.

answered Aug 16 '13 at 20:28

Mikhail M

929
2
10
23

score -2 · Answer 6 · answered Nov 09 '13 at 21:07

If your data on file, is not changing so often, you can choose to serialize it. Python interpreter will deserialize it much more quickly. You can use cPickle module.

Or creating 8 separate processes is an other option. Because, having an only dict make it much more possible. You can interact between those processes via Pipe in "multiprocessing" module or, "socket" module.

Best regards

Barış ÇUHADAR.

Read large file in parallel?

6 Answers6

Linked