1

I have two dictionaries mapping IDs to values. For simplicity, lets say those are the dictionaries:

d_source = {'a': 1, 'b': 2, 'c': 3, '3': 3}
d_target = {'A': 1, 'B': 2, 'C': 3, '1': 1}

As named, the dictionaries are not symmetrical. I would like to get a dictionary of keys from dictionaries d_source and d_target whose values match. The resulting dictionary would have d_source keys as its own keys, and d_target keys as that keys value (in either a list, tuple or set format).

This would be The expected returned value for the above example should be the following list:

{'a': ('1', 'A'),
 'b': ('B',),
 'c': ('C',),
 '3': ('C',)}

There are two somewhat similar questions, but those solutions can't be easily applied to my question.

Some characteristics of the data:

  1. Source would usually be smaller than target. Having roughly few thousand sources (tops) and a magnitude more targets.
  2. Duplicates in the same dict (both d_source and d_target) are not too likely on values.
  3. matches are expected to be found for (a rough estimate) not more than 50% than d_source items.
  4. All keys are integers.

What is the best (performance wise) solution to this problem? Modeling data into other datatypes for improved performance is totally ok, even when using third party libraries (i'm thinking numpy)

Community
  • 1
  • 1
NirIzr
  • 3,131
  • 2
  • 30
  • 49
  • 4
    What if there are more than one key with same value? Does all the value from dict should be the part of tuple? – Moinuddin Quadri Sep 14 '16 at 20:49
  • Yes, I'll add that to my question. – NirIzr Sep 14 '16 at 20:50
  • Are you sure there are no repeated values inside each dictionary? Does the order within each tuple or within the final list matter? – Rory Daulton Sep 14 '16 at 20:51
  • I've updated my question quite a bit to address those issues. I did not notice them before. The best solution will actually be a more complex solution. – NirIzr Sep 14 '16 at 20:58
  • If nothing maps to the value for a key in `d_source`, do you want a) an empty tuple as the value, or b) for the key not to be present at all in the output ? – wim Sep 14 '16 at 21:18
  • b) is preferred, but a) is also ok. – NirIzr Sep 14 '16 at 21:19
  • What are the typical sizes of target and source? Roughly how often are there multiple hits in the target and how often are there non-matched source keys? I'm preparing to do some timing runs unless you plan to :) – Craig Burgler Sep 14 '16 at 21:26
  • @CraigBurgler Edited with answers. Thanks a lot! timings would be awesome. – NirIzr Sep 14 '16 at 21:38
  • So the data characteristics provided are inconsistent, i.e. you cannot have 10x more targets than source, not likely (10%) value duplicates in both target and source, and 50% source to target match. For the most accurate timings you should run the various working methods against your actual data. – Craig Burgler Sep 14 '16 at 22:08
  • 1
    BTW if I don't win the timing race I was focusing more on readability :) – Craig Burgler Sep 14 '16 at 22:11
  • @CraigBurgler "source to target" are out of # of sources. There would be a lot of "unmatched" targets but I don't understand why there can't be ±50% matched sources. Also, 50% is a top boundary. RE: timing race - naturally :D – NirIzr Sep 14 '16 at 22:14
  • you are correct, timings coming up. I would be interested to see a numpy version as well. – Craig Burgler Sep 14 '16 at 22:18
  • sorry had to talk to the super_vet; results added to my answer – Craig Burgler Sep 14 '16 at 22:58

5 Answers5

2

All answers have O(n^2) efficiency which isn't very good so I thought of answering myself.

I use 2(source_len) + 2(dict_count)(dict_len) memory and I have O(2n) efficiency which is the best you can get here I believe.

Here you go:

from collections import defaultdict

d_source = {'a': 1, 'b': 2, 'c': 3, '3': 3}
d_target = {'A': 1, 'B': 2, 'C': 3, '1': 1}

def merge_dicts(source_dict, *rest):
    flipped_rest = defaultdict(list)
    for d in rest:
        while d:
            k, v = d.popitem()
            flipped_rest[v].append(k)
    return {k: tuple(flipped_rest.get(v, ())) for k, v in source_dict.items()}

new_dict = merge_dicts(d_source, d_target)

By the way, I'm using a tuple in order not to link the resulting lists together.


As you've added specifications for the data, here's a closer matching solution:

d_source = {'a': 1, 'b': 2, 'c': 3, '3': 3}
d_target = {'A': 1, 'B': 2, 'C': 3, '1': 1}

def second_merge_dicts(source_dict, *rest):
    """Optimized for ~50% source match due to if statement addition.

    Also uses less memory.
    """
    unique_values = set(source_dict.values())
    flipped_rest = defaultdict(list)
    for d in rest:
        while d:
            k, v = d.popitem()
            if v in unique_values:
                flipped_rest[v].append(k)
    return {k: tuple(flipped_rest.get(v, ())) for k, v in source_dict.items()}

new_dict = second_merge_dicts(d_source, d_target)
Bharel
  • 23,672
  • 5
  • 40
  • 80
  • Your output is bogus. Don't pop from the target or else you will get `KeyError` for dupes. – wim Sep 14 '16 at 21:26
  • @wim Aye, thought it will be more efficient but forgot it also repeats there. Oh well, still as efficient as it can get probably. – Bharel Sep 14 '16 at 21:27
  • I get the same output that @Bharel posted when I run the code – Craig Burgler Sep 14 '16 at 21:28
  • This is quite good now, I just have some suggestions: 1) add `if v in flipped_target` to the dict comprehension 2) change the list comprehension to a plain old for loop. – wim Sep 14 '16 at 21:29
  • @Bharel theoretical efficiency is great, but my main focus here is practical efficiency. See (another) edit for some expected numbers, and thanks for your answers so far! – NirIzr Sep 14 '16 at 21:40
  • 2
    @NirIzr optimized for 50% source match. – Bharel Sep 14 '16 at 21:58
1
from collections import defaultdict
from pprint import pprint

d_source  = {'a': 1, 'b': 2, 'c': 3, '3': 3}
d_target = {'A': 1, 'B': 2, 'C': 3, '1': 1}

d_result = defaultdict(list)
{d_result[a].append(b) for a in d_source for b in d_target if d_source[a] == d_target[b]}

pprint(d_result)

Output:

{'3': ['C'],
 'a': ['A', '1'],
 'b': ['B'],
 'c': ['C']}

Timing results:

from collections import defaultdict
from copy import deepcopy
from random import randint
from timeit import timeit


def Craig_match(source, target):
    result = defaultdict(list)
    {result[a].append(b) for a in source for b in target if source[a] == target[b]}
    return result

def Bharel_match(source_dict, *rest):
    flipped_rest = defaultdict(list)
    for d in rest:
        while d:
            k, v = d.popitem()
            flipped_rest[v].append(k)
    return {k: tuple(flipped_rest.get(v, ())) for k, v in source_dict.items()}

def modified_Bharel_match(source_dict, *rest):
    """Optimized for ~50% source match due to if statement addition.

    Also uses less memory.
    """
    unique_values = set(source_dict.values())
    flipped_rest = defaultdict(list)
    for d in rest:
        while d:
            k, v = d.popitem()
            if v in unique_values:
                flipped_rest[v].append(k)
    return {k: tuple(flipped_rest.get(v, ())) for k, v in source_dict.items()}

# generate source, target such that:
# a) ~10% duplicate values in source and target
# b) 2000 unique source keys, 20000 unique target keys
# c) a little less than 50% matches source value to target value
# d) numeric keys and values
source = {}
for k in range(2000):
    source[k] = randint(0, 1800)
target = {}
for k in range(20000):
    if k < 1000:
        target[k] = randint(0, 2000)
    else:
        target[k] = randint(2000, 19000)

best_time = {}
approaches = ('Craig', 'Bharel', 'modified_Bharel')
for a in approaches:
    best_time[a] = None

for _ in range(3):
    for approach in approaches:
        test_source = deepcopy(source)
        test_target = deepcopy(target)

        statement = 'd=' + approach + '_match(test_source,test_target)'
        setup = 'from __main__ import test_source, test_target, ' + approach + '_match'
        t = timeit(stmt=statement, setup=setup, number=1)
        if not best_time[approach] or (t < best_time[approach]):
            best_time[approach] = t

for approach in approaches:
    print(approach, ':', '%0.5f' % best_time[approach])

Output:

Craig : 7.29259
Bharel : 0.01587
modified_Bharel : 0.00682
Craig Burgler
  • 1,749
  • 10
  • 19
  • Could you please add Bharel's second match for completeness's sake? – NirIzr Sep 14 '16 at 23:03
  • 1
    I accepted this answer because although it did not eventually provide the fastest solution, it went the extra mile in providing me with the information I needed. Thanks! – NirIzr Sep 15 '16 at 00:02
1

Here is another solution. There are a lot of ways to do this

for key1 in d1:
    for key2 in d2:
        if d1[key1] == d2[key2]:
            stuff

Note that you can use any name for key1 and key2.

Steven Walton
  • 505
  • 6
  • 20
1

This maybe "cheating" in some regards, although if you are looking for the matching values of the keys regardless of the case sensitivity then you might be able to do:

import sets

aa = {'a': 1, 'b': 2, 'c':3}
bb = {'A': 1, 'B': 2, 'd': 3}

bbl = {k.lower():v for k,v in bb.items()}

result = {k:k.upper() for k,v in aa.iteritems() & bbl.viewitems()}
print( result )

Output:

{'a': 'A', 'b': 'B'}

The bbl declaration changes the bb keys into lowercase (it could be either aa, or bb).

* I only tested this on my phone, so just throwing this idea out there I suppose... Also, you've changed your question radically since I began composing my answer, so you get what you get.

l'L'l
  • 44,951
  • 10
  • 95
  • 146
0

It is up to you to determine the best solution. Here is a solution:

def dicts_to_tuples(*dicts):
    result = {}
    for d in dicts:
        for k,v in d.items():
            result.setdefault(v, []).append(k)
    return [tuple(v) for v in result.values() if len(v) > 1]

d1 = {'a': 1, 'b': 2, 'c':3}
d2 = {'A': 1, 'B': 2}
print dicts_to_tuples(d1, d2)
Robᵩ
  • 163,533
  • 20
  • 239
  • 308