1

I have a list contains 700,000 items and a dictionary contains 300,000 keys. Some of the 300k keys are contained within the 700k items stored in the list. Now, I have built a simple comparison and handling loop:

# list contains about 700k lines - ids,firstname,lastname,email,lastupdate
list = open(r'myfile.csv','rb').readlines()
dictionary = {}
# dictionary contains 300k ID keys
dictionary[someID] = {'first':'john',
                      'last':'smith',
                      'email':'john.smith@gmail.com',
                      'lastupdate':datetime_object}
for line in list:
    id, firstname, lastname, email, lastupdate = line.split(',')
    lastupdate = datetime.datetime.strptime(lastupdate,'%Y-%m-%d %H:%M:%S')
    if id in dictionary.keys():
        # update dictionary[id]'s keys:values
        if lastupdate > dictionary[id]['lastupdate']:
            # update values in dictionary[id]
    else:
        # create new id inside dictionary and fill with keys:values

I wish to speed things up a little and use multiprocessing for this kind of job. For this, I thought I could split the list to four smaller lists, Pool.map each list and check them separately with each of the four processes I'll make to create four new dictionaries. Problem is that in order create one whole dictionary with last updated values, I will have to repeat the process with the 4 new created dictionaries and so on.

Have anyone ever experienced with such problem and have a solution or an idea for that problem?

Thanks

devdc
  • 161
  • 1
  • 4
  • 13

2 Answers2

2
if id in dictionary.keys():

NO! Please No! This is an O(n) operation!!! The right way to do it is simply

if id in dictionary

which takes O(1) time!!!

Before thinking about using multiprocessing etc you should avoid this really inefficient operations. If the dictionary has 300k keys that line was probably the bottleneck.


I have assumed python2; if this is not the case then you should use the . In python3 using key in dictionary.keys() is O(1) because .keys() now returns a view of the dict instead of the list of keys, however is still a bit faster to omit .keys().

Bakuriu
  • 98,325
  • 22
  • 197
  • 231
  • Hey dude. First- I wish to thank you for answering. Second- I would like to make sure I understand you correctly. You mean that the whole process, including the `for` statement and the `if-in` will take O(n)? I recently thought of another wrong-but-might-be-better-way which is using a `try-except` statement as a key-existance check. What do you suggest? Thanks again. – devdc Nov 23 '13 at 18:29
  • 1
    @Hedgie No, I mean that the *single* operation `key in some_dictionary.keys()` takes O(n) time (in python2), because `.keys()` has to build a list of all the keys of the dictionary and the `in` will have to scan the whole list. If you do `key in some_dictionary`, *without* calling `.keys()` the dictionary will use the hash of the object to check the condition in O(1) expected time. – Bakuriu Nov 24 '13 at 17:49
  • Good to know. Sounds like it solves my problem. I'll try and get back with feedback :) – devdc Nov 24 '13 at 23:16
1

I think you should start with not splitting the same line for each token over and over again:

id, firstname, lastname, email, lastupdate = line.split(',')
lastupdate = datetime.datetime.strptime(lastupdate,'%Y-%m-%d %H:%M:%S')
perreal
  • 94,503
  • 21
  • 155
  • 181