Merge two big dictionaries in parallel

Question

I have two big dictionaries. Each of them looks like this {"md5":(value1,value2)}. Each of them of contains up to 10 million elements (I used dict.fromkeys() to generate them). But now I want to merge them into a single one to eliminate duplicates. What is the fastest way to do so? Can I parallelize it? It seems this is a CPU bounded problem because if I use dict.update(), one core is always 100% used, while the other cores are idle.

Thank you. I know dictionary is thread safe in terms of modifying values, but there are two challenges: the large size and the shared memory. None of your posts addresses these challenges. — fanchyna, May 05 '15 at 18:34

score 0 · Answer 1 · edited May 23 '17 at 12:14

0

This answer might be "cheating", but if you use a ChainMap, this merging (just creating the ChainMap) is O(1). As fast as it gets. Instead of working with a dict, you work with a dict-like object.

The tradeoff is that each lookup in the resulting collection uses 1 or 2 lookups in the underlying dicts. If this overhead is acceptable in your case, ChainMap is the answer.

edited May 23 '17 at 12:14

Community

1
1

answered May 05 '15 at 18:20

shx2

61,779
13
130
153

Thank you, but unfortunately, I am only allowed to use Python2.6. ChainMap is introduced in Python3. – fanchyna May 05 '15 at 18:32
@fanchyna see this recipe: http://code.activestate.com/recipes/305268-chained-map-lookups/ – shx2 May 05 '15 at 19:16

score 0 · Answer 2 · answered May 06 '15 at 19:09

Thank you for all answers. Actually, I found the answer by writing a piece of code myself. It is very similar to the merge sort, but I have tested it and compared against all other approaches I tried.

Here is the piece of code. The idea is to first sort these two dictionaries (this is fast) and convert them into list of tuples, and then merge them. I hope somebody can beat me, but this is the fastest I've known so far. The result is a sorted tuple list, so if I want to merge more big dictionaries, I can just sort the new input and merge it into the existing tuple list.

def merge(dict_01,dict_02):
    sorted_01 = sorted(dict_01.items(),key=operator.itemgetter(0))
    sorted_02 = sorted(dict_02.items(),key=operator.itemgetter(0))

    # merge them
    l = 0
    r = 0
    while True:
        if sorted_02[r][0] > sorted_01[l][0]:
            l += 1
            # if l reaches the end, append all left on sorted_02 to sorted_01 and break
            if l == len(sorted_01):
                 sorted_01 += sorted_02[r:]
                 break
        elif sorted_02[r][0] < sorted_01[l][0]:
            sorted_01.insert(l,sorted_02[r])
            r += 1
            l += 1
            # if r reached the end, merging is done, break
            if r == len(sorted_02):
                break
            # if l reaches the end, append all left on sorted_02 to sorted_01 and break
            if l == len(sorted_01):
                sorted_01 += sorted_02[r:]
                break
        else:
            r += 1
            if r == len(sorted_02):
                break


   return sorted_01

score -1 · Answer 3 · edited May 06 '15 at 07:13

-1

Try this:

x = {"md5":("value1","value2")}
y = {"md7":("value1","value2")}
z = dict(x.items() + y.items())
print z

edited May 06 '15 at 07:13

Łukasz Rogalski

22,092
8
59
93

answered May 06 '15 at 06:31

fahad

2,999
3
17
20

Merge two big dictionaries in parallel

3 Answers3