I have a large file ( >= 1GB) that I'm trying to read and then upload the contents into a dictionary. With just a simple code that reads one line at a time, it's taking about 8 minutes to just read the file and populate the dictionary. The code snippet that I'm using is below:
with open(filename, 'r') as f:
for line in f:
toks = line.rstrip().split()
id1 = toks[0]
id2 = toks[1]
start = int(toks[4])
end = int(toks[5])
if id1 not in my_dict:
my_dict[id1] = [[start, end]]
else:
if [start, end] not in my_dict[id1]:
my_dict[id1].append([start,end])
if id2 not in my_dict:
my_dict[id2] = [[start, end]]
else:
if [start, end] not in my_dict[id2]:
my_dict[id2].append([start,end])
Now, just running this block of code alone is taking a long time and I'm wondering if I can speed up this process using multiprocessing may be ? I've looked into some material that are close to what I want to do like here and here and many others. But, given I'm very new to this, I'm struggling to even decide if multiprocessing is the right way to go. Also, in most of the multiprocessing-related resources available, they don't explain how we update the dictionary. I think I read somewhere that sharing the same dictionary is a bad idea. I wish I could be more specific with my question and hopefully won't get flagged as not suitable for SO but I just want to speed up this process of building the dictionary.
EDIT
After the suggestions made by @juanpa.arrivillaga , my code looks like :
import collections
my_dict = collections.defaultdict(set)
with open(filename, 'r') as f:
for line in f:
toks = line.rstrip().split()
id1 = toks[0]
id2 = toks[1]
start = int(toks[4])
end = int(toks[5])
my_dict[id1].add((start,end))
my_dict[id2].add((start,end))
This reduced my time to about ~21 seconds when run on a file of size ~500MB with 11mil lines.