1

I have two big dictionaries. Each of them looks like this {"md5":(value1,value2)}. Each of them of contains up to 10 million elements (I used dict.fromkeys() to generate them). But now I want to merge them into a single one to eliminate duplicates. What is the fastest way to do so? Can I parallelize it? It seems this is a CPU bounded problem because if I use dict.update(), one core is always 100% used, while the other cores are idle.

Saksham Varma
  • 2,122
  • 13
  • 15
fanchyna
  • 2,623
  • 7
  • 36
  • 38
  • 1
    Thank you. I know dictionary is thread safe in terms of modifying values, but there are two challenges: the large size and the shared memory. None of your posts addresses these challenges. – fanchyna May 05 '15 at 18:34

3 Answers3

0

This answer might be "cheating", but if you use a ChainMap, this merging (just creating the ChainMap) is O(1). As fast as it gets. Instead of working with a dict, you work with a dict-like object.

The tradeoff is that each lookup in the resulting collection uses 1 or 2 lookups in the underlying dicts. If this overhead is acceptable in your case, ChainMap is the answer.

Community
  • 1
  • 1
shx2
  • 61,779
  • 13
  • 130
  • 153
  • Thank you, but unfortunately, I am only allowed to use Python2.6. ChainMap is introduced in Python3. – fanchyna May 05 '15 at 18:32
  • @fanchyna see this recipe: http://code.activestate.com/recipes/305268-chained-map-lookups/ – shx2 May 05 '15 at 19:16
0

Thank you for all answers. Actually, I found the answer by writing a piece of code myself. It is very similar to the merge sort, but I have tested it and compared against all other approaches I tried.

Here is the piece of code. The idea is to first sort these two dictionaries (this is fast) and convert them into list of tuples, and then merge them. I hope somebody can beat me, but this is the fastest I've known so far. The result is a sorted tuple list, so if I want to merge more big dictionaries, I can just sort the new input and merge it into the existing tuple list.

def merge(dict_01,dict_02):
    sorted_01 = sorted(dict_01.items(),key=operator.itemgetter(0))
    sorted_02 = sorted(dict_02.items(),key=operator.itemgetter(0))

    # merge them
    l = 0
    r = 0
    while True:
        if sorted_02[r][0] > sorted_01[l][0]:
            l += 1
            # if l reaches the end, append all left on sorted_02 to sorted_01 and break
            if l == len(sorted_01):
                 sorted_01 += sorted_02[r:]
                 break
        elif sorted_02[r][0] < sorted_01[l][0]:
            sorted_01.insert(l,sorted_02[r])
            r += 1
            l += 1
            # if r reached the end, merging is done, break
            if r == len(sorted_02):
                break
            # if l reaches the end, append all left on sorted_02 to sorted_01 and break
            if l == len(sorted_01):
                sorted_01 += sorted_02[r:]
                break
        else:
            r += 1
            if r == len(sorted_02):
                break


   return sorted_01
fanchyna
  • 2,623
  • 7
  • 36
  • 38
-1

Try this:

x = {"md5":("value1","value2")}
y = {"md7":("value1","value2")}
z = dict(x.items() + y.items())
print z
Łukasz Rogalski
  • 22,092
  • 8
  • 59
  • 93
fahad
  • 2,999
  • 3
  • 17
  • 20