I have two big dictionaries. Each of them looks like this
{"md5":(value1,value2)}
. Each of them of contains up to 10 million elements (I used dict.fromkeys()
to generate them). But now I want to merge them into a single one to eliminate duplicates. What is the fastest way to do so? Can I parallelize it? It seems this is a CPU bounded problem because if I use dict.update()
, one core is always 100% used, while the other cores are idle.

- 2,122
- 13
- 15

- 2,623
- 7
- 36
- 38
-
1Thank you. I know dictionary is thread safe in terms of modifying values, but there are two challenges: the large size and the shared memory. None of your posts addresses these challenges. – fanchyna May 05 '15 at 18:34
3 Answers
This answer might be "cheating", but if you use a ChainMap
, this merging (just creating the ChainMap) is O(1). As fast as it gets. Instead of working with a dict, you work with a dict-like object.
The tradeoff is that each lookup in the resulting collection uses 1 or 2 lookups in the underlying dicts. If this overhead is acceptable in your case, ChainMap is the answer.
-
Thank you, but unfortunately, I am only allowed to use Python2.6. ChainMap is introduced in Python3. – fanchyna May 05 '15 at 18:32
-
@fanchyna see this recipe: http://code.activestate.com/recipes/305268-chained-map-lookups/ – shx2 May 05 '15 at 19:16
Thank you for all answers. Actually, I found the answer by writing a piece of code myself. It is very similar to the merge sort, but I have tested it and compared against all other approaches I tried.
Here is the piece of code. The idea is to first sort these two dictionaries (this is fast) and convert them into list of tuples, and then merge them. I hope somebody can beat me, but this is the fastest I've known so far. The result is a sorted tuple list, so if I want to merge more big dictionaries, I can just sort the new input and merge it into the existing tuple list.
def merge(dict_01,dict_02):
sorted_01 = sorted(dict_01.items(),key=operator.itemgetter(0))
sorted_02 = sorted(dict_02.items(),key=operator.itemgetter(0))
# merge them
l = 0
r = 0
while True:
if sorted_02[r][0] > sorted_01[l][0]:
l += 1
# if l reaches the end, append all left on sorted_02 to sorted_01 and break
if l == len(sorted_01):
sorted_01 += sorted_02[r:]
break
elif sorted_02[r][0] < sorted_01[l][0]:
sorted_01.insert(l,sorted_02[r])
r += 1
l += 1
# if r reached the end, merging is done, break
if r == len(sorted_02):
break
# if l reaches the end, append all left on sorted_02 to sorted_01 and break
if l == len(sorted_01):
sorted_01 += sorted_02[r:]
break
else:
r += 1
if r == len(sorted_02):
break
return sorted_01

- 2,623
- 7
- 36
- 38
Try this:
x = {"md5":("value1","value2")}
y = {"md7":("value1","value2")}
z = dict(x.items() + y.items())
print z

- 22,092
- 8
- 59
- 93

- 2,999
- 3
- 17
- 20