I have a large file which I need to read in and make a dictionary from. I would like this to be as fast as possible. However my code in python is too slow. Here is a minimal example that shows the problem.
First make some fake data
paste <(seq 20000000) <(seq 2 20000001) > largefile.txt
Now here is a minimal piece of python code to read it in and make a dictionary.
import sys
from collections import defaultdict
fin = open(sys.argv[1])
dict = defaultdict(list)
for line in fin:
parts = line.split()
dict[parts[0]].append(parts[1])
Timings:
time ./read.py largefile.txt
real 0m55.746s
However it is possible to read the whole file much faster as:
time cut -f1 largefile.txt > /dev/null
real 0m1.702s
My CPU has 8 cores, is it possible to parallelize this program in python to speed it up?
One possibility might be to read in large chunks of the input and then run 8 processes in parallel on different non-overlapping subchunks making dictionaries in parallel from the data in memory then read in another large chunk. Is this possible in python using multiprocessing somehow?
Update. The fake data was not very good as it had only one value per key. Better is
perl -E 'say int rand 1e7, $", int rand 1e4 for 1 .. 1e7' > largefile.txt
(Related to Read in large file and make dictionary .)