2

I have two dictonaries A and B and both have the same keys a, b and value. All 3 values of behind those keys are numpy arrays of the same size, but the size may differ between A and B. If found this link here but it is only for onedimensional keys: One can see a combination a(0),b(0) as coordinates in cartesian space and value(0) as their value. And i have two datasets A and B. As an example:

A = {'a': numpy.array([1, 1, 9, 9]),
     'b': numpy.array([0, 1, 0, 1]),
     'value': numpy.array([1, 2, 3, 4])}
B  = {'a': numpy.array([1, 1, 7, 7]),
     'b': numpy.array([0, 1, 0, 1]),
     'value': numpy.array([101, 102, 1003, 1004])}

I need to sum the values of those dictionarys, if both keys are the same, otherwise i want to append the keys and the values. In the example: Both dictionaries share the key combination a:1 and b:0, as well as a:1 and b:1. Their values are added up 1+101=102 and 2+102=104. The key combination a:9, b:0 and a:9, b:1 are only in dictionary A The key combination a:7, b:0 and a:7, b:1 are only in dictionary B So I want this result

C = {'a': numpy.array([1, 1, 9, 9, 7, 7]),
     'b': numpy.array([0, 1, 0, 1, 0, 1]),
     'value': numpy.array([102, 104, 3, 4, 1003, 1004 ])}

I came up with a solution, which takes dictionary A and modifies it by adding or appending something from dictionary B. Therefore it first generates one-dimensional hash keys of those two-dimensional key combinations in A and one for those in B. Then uses numpy.intersect() to find the common keys in both dictionaries and adds the values of B to the values of A at that indices. Afterwards i take the invert of the intersection and append both the uncommon keys and the value to dictionary A.

def example(A, B):
    # generate hash keys (32 bit shift because values in a and b are larger than in example)
    hash_A = map(lambda a, b: (int(a) << 32) + int(b), A['a'], A['b'])
    hash_B = map(lambda a, b: (int(a) << 32) + int(b), B['a'], B['b'])

    # intersection is now 1-dimensional and easy
    intersect = numpy.intersect1d(hash_A, hash_B)

    # common keys
    A['value'][numpy.in1d(hash_A, intersect)] += B['value'][numpy.in1d(hash_B, intersect)]

    # keys only in B and not in A
    only_in_B = numpy.in1d(hash_B, intersect, invert=True)
    if any(only_in_B):
        A['a'] = numpy.append(A['a'], B['a'][only_in_B])
        A['value'] = numpy.append(A['value'], B['value'][only_in_B])
        A['b'] = numpy.append(A['b'], B['b'][only_in_B])

    return A

But my solution seems too slow to be useful and I cannot think of a quicker way of getting there. The numpy.arrays used have millions of entries, and this is done for several combinations of dictionaries. speed is an issue. Any help would be appreciated.

Community
  • 1
  • 1
Nras
  • 4,251
  • 3
  • 25
  • 37

1 Answers1

1

I would start by changing the data structure to something like:

valuesA = {(A['a'][x], A['b'][x]): A['value'][x] for x in range(len(A['a']))}

That should give you :

{(1, 0): 1, (9, 0): 3, (1, 1): 2, (9, 1): 4}

Same for B:

valuesB = {(B['a'][x], B['b'][x]): B['value'][x] for x in range(len(B['a']))}
# {(1, 0): 101, (7, 0): 1003, (1, 1): 102, (7, 1): 1004}

Then merge valuesA into valuesB:

for key, value in valuesA.items():
    valuesB[key] = valuesB.get(key, 0) + value

Result is:

{(9, 0): 3, (7, 0): 1003, (9, 1): 4, (7, 1): 1004, (1, 0): 102, (1, 1): 104}

If you really need to, you can put it back into its original form:

keys = valuesB.keys()
C = {'a': [x[0] for x in keys], 'b': [x[1] for x in keys], 'value': [valuesB[x] for x in keys]}

Result is

{'a': [9, 7, 9, 7, 1, 1], 
 'b': [0, 0, 1, 1, 0, 1], 
 'value': [3, 1003, 4, 1004, 102, 104]}

Nota: if the order actually matters, you can consider using OrderedDict instead of plain dicts, which is a dict that preserves insertion order.

njzk2
  • 38,969
  • 7
  • 69
  • 107
  • The order doesn't matter. No idea why i didn't think of tuples as keys. I will test your answer for speed. – Nras Jul 31 '14 at 13:41
  • So if I put your code inside a method i get roughly the same time, just insignificantly better. But if i keep the data for A stored in that format and therefore transform values for B inside the method and omit the reformatting of the result back to the original form i roughly get 30% faster. That is nice and I will ultimately accept your answer, if there is no other suggestions. – Nras Jul 31 '14 at 15:30
  • one optimization would be to make sure you do the loop merge on the smallest of [valuesA, valuesB] – njzk2 Jul 31 '14 at 15:40
  • I'm doing that by default. I realized when i made up this simple example, i made it too simple: actually i am not doing this for only the one key 'value', but i also have two more keys (lets call them 'y' and 'z'), where i want to sum up the values in the same manner as i do for the key 'value'. In my solution i could just use 2 more keys, where i add the values, in your solution i struggle to do so. Any advice? – Nras Aug 01 '14 at 07:27
  • Sorry, my comment was a bit premature. Of course i simply use the same tuple as key and the value is a numpy array with them three values. Everything works fine! – Nras Aug 01 '14 at 08:32