I have a list contains 700,000 items and a dictionary contains 300,000 keys. Some of the 300k keys are contained within the 700k items stored in the list. Now, I have built a simple comparison and handling loop:
# list contains about 700k lines - ids,firstname,lastname,email,lastupdate
list = open(r'myfile.csv','rb').readlines()
dictionary = {}
# dictionary contains 300k ID keys
dictionary[someID] = {'first':'john',
'last':'smith',
'email':'john.smith@gmail.com',
'lastupdate':datetime_object}
for line in list:
id, firstname, lastname, email, lastupdate = line.split(',')
lastupdate = datetime.datetime.strptime(lastupdate,'%Y-%m-%d %H:%M:%S')
if id in dictionary.keys():
# update dictionary[id]'s keys:values
if lastupdate > dictionary[id]['lastupdate']:
# update values in dictionary[id]
else:
# create new id inside dictionary and fill with keys:values
I wish to speed things up a little and use multiprocessing for this kind of job. For this, I thought I could split the list to four smaller lists, Pool.map each list and check them separately with each of the four processes I'll make to create four new dictionaries. Problem is that in order create one whole dictionary with last updated values, I will have to repeat the process with the 4 new created dictionaries and so on.
Have anyone ever experienced with such problem and have a solution or an idea for that problem?
Thanks