102

I am working on a search program over an inverted index. The index itself is a dictionary whose keys are terms and whose values are themselves dictionaries of short documents, with ID numbers as keys and their text content as values.

To perform an 'AND' search for two terms, I thus need to intersect their postings lists (dictionaries). What is a clear (not necessarily overly clever) way to do this in Python? I started out by trying it the long way with iter:

p1 = index[term1]  
p2 = index[term2]
i1 = iter(p1)
i2 = iter(p2)
while ...  # not sure of the 'iter != end 'syntax in this case
...
martineau
  • 119,623
  • 25
  • 170
  • 301
norman
  • 5,128
  • 13
  • 44
  • 75
  • {i:dict(p1[i],**p2[i]) for i in p1 if i in p2} – mtadd Sep 01 '13 at 00:27
  • My above comment will intersect your term dictionaries, but union-merge your posting lists....if you also want to intersect your posting lists on your document ID numbers, you can use `{term:{doc_id:p1[term][doc_id] for doc_id in p1[term] if doc_id in p2[term]} for term in p1 if term in p2}` – mtadd Sep 01 '13 at 00:42
  • 4
    It's awfully unclear what your desired output is (or even what your input is). Judging from your description of the problem, it sounds like you have nested dictionaries that you want to... "intersect", whatever that means. But the answer you accepted not only does not intersect nested dictionaries, it doesn't even intersect dictionaries. All it does is intersect the *keys* of two dictionaries. Please clarify your question and update it with a [mcve]. – Aran-Fey Sep 16 '18 at 17:52
  • 1
    Possible duplicate of [Efficient dictionary key intersection](//stackoverflow.com/q/52356646) – Aran-Fey Sep 16 '18 at 18:10

10 Answers10

134

A little known fact is that you don't need to construct sets to do this:

Python 3

d1 = {'a': 1, 'b': 2}    
d2 = {'b': 2, 'c': 3}    
print(d1.keys() & d2.keys()) # {'b'}

Python 2

In Python 2, we replace keys with viewkeys. The same applies to values (viewvalues) and items(viewitems).

In [78]: d1 = {'a': 1, 'b': 2}

In [79]: d2 = {'b': 2, 'c': 3}

In [80]: d1.viewkeys() & d2.viewkeys()
Out[80]: {'b'}

From the documentation of viewitems:

In [113]: d1.viewitems??
Type:       builtin_function_or_method
String Form:<built-in method viewitems of dict object at 0x64a61b0>
Docstring:  D.viewitems() -> a set-like object providing a view on D's items

For larger dicts this also slightly faster than constructing sets and then intersecting them:

In [122]: d1 = {i: rand() for i in range(10000)}

In [123]: d2 = {i: rand() for i in range(10000)}

In [124]: timeit d1.viewkeys() & d2.viewkeys()
1000 loops, best of 3: 714 µs per loop

In [125]: %%timeit
s1 = set(d1)
s2 = set(d2)
res = s1 & s2

1000 loops, best of 3: 805 µs per loop

For smaller `dict`s `set` construction is faster:

In [126]: d1 = {'a': 1, 'b': 2}

In [127]: d2 = {'b': 2, 'c': 3}

In [128]: timeit d1.viewkeys() & d2.viewkeys()
1000000 loops, best of 3: 591 ns per loop

In [129]: %%timeit
s1 = set(d1)
s2 = set(d2)
res = s1 & s2

1000000 loops, best of 3: 477 ns per loop

We're comparing nanoseconds here, which may or may not matter to you. In any case, you get back a set, so using viewkeys/keys eliminates a bit of clutter.

nbro
  • 15,395
  • 32
  • 113
  • 196
Phillip Cloud
  • 24,919
  • 11
  • 68
  • 88
126

In general, to construct the intersection of dictionaries in Python, you can first use the & operator to calculate the intersection of sets of the dictionary keys (dictionary keys are set-like objects in Python 3):

dict_a = {"a": 1, "b": 2}
dict_b = {"a": 2, "c": 3} 

intersection = dict_a.keys() & dict_b.keys()  # {'a'}

On Python 2 you have to convert the dictionary keys to sets yourself:

keys_a = set(dict_a.keys())
keys_b = set(dict_b.keys())
intersection = keys_a & keys_b

Then given the intersection of the keys, you can then build the intersection of your values however is desired. You have to make a choice here, since the concept of set intersection doesn't tell you what to do if the associated values differ. (This is presumably why the & intersection operator is not defined directly for dictionaries in Python).

In this case it sounds like your values for the same key would be equal, so you can just choose the value from one of the dictionaries:

dict_of_dicts_a = {"a": {"x":1}, "b": {"y":3}}
dict_of_dicts_b = {"a": {"x":1}, "c": {"z":4}} 

shared_keys = dict_of_dicts_a.keys() & dict_of_dicts_b.keys()

# values equal so choose values from a:
dict_intersection = {k: dict_of_dicts_a[k] for k in shared_keys }  # {"a":{"x":1}}

Other reasonable methods of combining values would depend on the types of the values in your dictionaries, and what they represent. For example you might also want the union of values for shared keys of dictionaries of dictionaries. Since the union of dictionaries doesn't depend on the values, it is well defined, and in python you can get it using the | operator:

# union of values for each key in the intersection:
dict_intersection_2 = { k: dict_of_dicts_a[k] | dict_of_dicts_b[k] for k in shared_keys }

Which in this case, with identical dictionary values for key "a" in both, would be the same result.

James
  • 24,676
  • 13
  • 84
  • 130
  • As @Aran-Fey pointed out in a comment beneath the question, this finds the intersection of the keys of two dictionaries, which arguably isn't the same thing as the intersection of two dictionaries since it completely ignores the values associated with the keys. – martineau Mar 28 '21 at 19:06
  • @martineau The concept of intersecting dictionary values is not very well defined - you might legitimately want to do different things with the values for intersecting keys if the values differ. I added some examples explaining this. – James Mar 30 '21 at 13:04
  • 1
    Your updates are an improvement IMO. FWIW one of the most intuitive (and succinct) approaches of determining the intersection that takes the values into consideration that I've ever seen is user @schwobaseggl's [answer](https://stackoverflow.com/a/48246160/355230) to a similar question. – martineau Mar 30 '21 at 15:14
93
In [1]: d1 = {'a':1, 'b':4, 'f':3}

In [2]: d2 = {'a':1, 'b':4, 'd':2}

In [3]: d = {x:d1[x] for x in d1 if x in d2}

In [4]: d
Out[4]: {'a': 1, 'b': 4}
emnoor
  • 2,528
  • 2
  • 18
  • 15
  • 19
    This should be the answer as this is the only one that shows how to get an intersection dict, not lists of keys, in a simple way. – Rafe Jun 28 '14 at 00:38
32

In Python 3, you can use

intersection = dict(dict1.items() & dict2.items())
union = dict(dict1.items() | dict2.items())
difference = dict(dict1.items() ^ dict2.items())
dccsillag
  • 919
  • 4
  • 14
  • 25
  • 4
    For `dict1 = {1: 1, 2:2}` and `dict2 = {2:4, 3:3}` the `intersection == set()`. Probably this is not what OP wants. – WloHu Jan 17 '19 at 09:26
  • @JCode I'm not sure I understand. `intersection = dict(...)` converts it back to a `dict`. Furthermore, I just tested it with `dict1 = {1: 1, 2: 2}` and `dict2 = {2: 4, 3: 3}` (the dictionaries above), and `intersection == set()` is `False`. It is an empty dictionary. – dccsillag Jan 17 '19 at 13:34
  • @DCPY Sorry, I missed that `dict(...)`. The point is that given my example the result is empty. Considering what OP accepted and more common use cases intersection of `.items()` makes little sense unless you're looking for literal duplicates. – WloHu Jan 17 '19 at 14:27
  • @JCode In that case, it is safe to assume that all values are unique. Then, instead of `dict1`, use `{v: k for k, v in dict1.items()}` and instead of `dict2` use `{v: k for k, v in dict2.items()}`. And, if you want the case where it's either the key _or_ the dictionary, then use the union of the intersection given in the answer and the intersection given in this comment. – dccsillag Jan 17 '19 at 14:36
  • This does not work for any type of value in dict().values() – Kots Feb 16 '21 at 13:20
3

None of the solutions so far solve the general case of intersecting N dictionaries.

So, if you want to handle the intersection of N arbitrary dictionaries:

from functools import reduce

def dict_intersection(*dict_list):
    return reduce(lambda a,b: dict(a.items() & b.items()), dict_list)

a = {k:k for k in range(0,5)} # {0: 0, 1: 1, 2: 2, 3: 3, 4: 4}
b = {k:k for k in range(2,7)} # {2: 2, 3: 3, 4: 4, 5: 5, 6: 6}
c = {k:k for k in range(3,8)} # {3: 3, 4: 4, 5: 5, 6: 6, 7: 7}

dict_intersection(a,b,c)  # {3:3, 4:4}
# or if you have a list of dicts
dicts = [{k:k for k in range(0+n,5+n)} for n in (0,2,3)] # == [a,b,c]
dict_intersection(*dicts) # {3:3, 4:4}

Using functools.reduce allows for completing the operation within a single iteration over the list of dictionaries instead of the multiple loops in some solutions. It also doesn't perform any additional conditional statements.

Trade-offs

Changing dict_intersection_v1 to dict_intersection_v2 we can see it performs faster for a larger lists of dictionaries and/or dictionaries (designing a proper experiment to test out which is a larger factor is outside the scope of this solution). This performance gain is due to reducing the amount of dictionary instantiations.

def dict_intersection_v1(*dict_list):
    return reduce(lambda a,b: dict(a.items() & b.items()),  dict_list)

def dict_intersection_v2(*dict_list):
    return dict(reduce(lambda a,b: a & b, (d.items() for d in dict_list)))

dict_lst1 = [{k:k for k in range(0+n,5+n)} for n in (0,2,3)] # = [a,b,c]
dict_lst2 = [{k:k for k in range(0,50,n)} for n in range(1,5)]]
dict_lst3 = [{k:k for k in range(0,500,n)} for n in range(40)]
dict_lst4 = [{k:k for k in range(0+n,500+n)} for n in range(400)]
dict list kv pair count dict_intersection_v1 dict_intersection_v2 relative difference
1 15 808 ns ± 4.31 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) 821 ns ± 0.785 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) + 1.6%
2 105 3.14 µs ± 11.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 2.38 µs ± 5.76 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) -24.2%
3 2155 36.9 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 25.1 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) -32.0%
4 200_000 9.08 ms ± 22 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.88 ms ± 5.31 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) -46.3%

The regression for result dict_lst1 is mainly due to difference in overhead between creating a dictionary after every intersection and the overhead due to dict.items() calls within the generator (and python's general function call overhead).

NB: I did test using the pre-calculated list of dict.items() for the for a dictionary instead of v2's building the generator on the fly.

I tested both passing in the pre-calculated list outside of the timings and within the timings, and, while it's statistically significant, it's less than 30 μs and 10 μs respectively. If you are trying to get these gains looking at a different language or Cython might be better.

plunker
  • 1,190
  • 2
  • 14
2

Okay, here is a generalized version of code above in Python3. It is optimized to use comprehensions and set-like dict views which are fast enough.

Function intersects arbitrary many dicts and returns a dict with common keys and a set of common values for each common key:

def dict_intersect(*dicts):
    comm_keys = dicts[0].keys()
    for d in dicts[1:]:
        # intersect keys first
        comm_keys &= d.keys()
    # then build a result dict with nested comprehension
    result = {key:{d[key] for d in dicts} for key in comm_keys}
    return result

Usage example:

a = {1: 'ba', 2: 'boon', 3: 'spam', 4:'eggs'}
b = {1: 'ham', 2:'baboon', 3: 'sausages'}
c = {1: 'more eggs', 3: 'cabbage'}

res = dict_intersect(a, b, c)
# Here is res (the order of values may vary) :
# {1: {'ham', 'more eggs', 'ba'}, 3: {'spam', 'sausages', 'cabbage'}}

Here the dict values must be hashable, if they aren't you could simply change set parentheses { } to list [ ]:

result = {key:[d[key] for d in dicts] for key in comm_keys}
thodnev
  • 1,564
  • 16
  • 20
  • i am passing a list of dict to the function but it is giving error. how can i edit above function so that alist of dict is passed, and key:value pair with common keys as well as value is obtained? – learnningprogramming Mar 31 '17 at 21:32
  • 1
    @learnningprogramming, hope you've already figured out how to solve the problem but for others curious: `*dicts` as function arguments means you need to pass numerous arguments, not a list of them. If you have `lst = [dict1, dict2, dict3, ...]` either use `dict_intersect(dict1, dict2, dict3, ...)` or unpack the list `dict_intersect(*lst)` – thodnev May 07 '17 at 23:24
2

To find full intersection by keys and values

d1 = {'a':1}
d2 = {'b':2, 'a':1}
{x:d1[x] for x in d1 if x in d2 and d1[x] == d2[x]}

>> {'a':1}
soldovskij
  • 141
  • 1
  • 7
1

Your question isn't precise enough to give single answer.

1. Key Intersection

If you want to intersect IDs from posts (credits to James) do:

common_ids = p1.keys() & p2.keys()

However if you want to iterate documents you have to consider which post has a priority, I assume it's p1. To iterate documents for common_ids, collections.ChainMap will be most useful:

from collections import ChainMap
intersection = {id: document
                for id, document in ChainMap(p1, p2)
                if id in common_ids}
for id, document in intersection:
    ...

Or if you don't want to create separate intersection dictionary:

from collections import ChainMap
posts = ChainMap(p1, p2)
for id in common_ids:
    document = posts[id]

2. Items Intersection

If you want to intersect items of both posts, which means to match IDs and documents, use code below (credits to DCPY). However this is only useful if you're looking for duplicates in terms.

duplicates = dict(p1.items() & p2.items())
for id, document in duplicates:
    ...

3. Iterate over p1 'AND' p2.

In case when by "'AND' search" and using iter you meant to search both posts then again collections.ChainMap is the best to iterate over (almost) all items in multiple posts:

from collections import ChainMap
for id, document in ChainMap(p1, p2):
    ...
WloHu
  • 1,369
  • 17
  • 24
0

Just wrap the dictionary instances with a simple class that gets both of the values you want

class DictionaryIntersection(object):
    def __init__(self,dictA,dictB):
        self.dictA = dictA
        self.dictB = dictB

    def __getitem__(self,attr):
        if attr not in self.dictA or attr not in self.dictB:
            raise KeyError('Not in both dictionaries,key: %s' % attr)

        return self.dictA[attr],self.dictB[attr]

x = {'foo' : 5, 'bar' :6}
y = {'bar' : 'meow' , 'qux' : 8}

z = DictionaryIntersection(x,y)

print z['bar']
Eric Urban
  • 3,671
  • 1
  • 18
  • 23
0
def two_keys(term_a, term_b, index):
    doc_ids = set(index[term_a].keys()) & set(index[term_b].keys())
    doc_store = index[term_a] # index[term_b] would work also
    return {doc_id: doc_store[doc_id] for doc_id in doc_ids}

def n_keys(terms, index):
    doc_ids = set.intersection(*[set(index[term].keys()) for term in terms])
    doc_store = index[term[0]]
    return {doc_id: doc_store[doc_id] for doc_id in doc_ids}

In [0]: index = {'a': {1: 'a b'}, 
                 'b': {1: 'a b'}}

In [1]: two_keys('a','b', index)
Out[1]: {1: 'a b'}

In [2]: n_keys(['a','b'], index)
Out[2]: {1: 'a b'}

I would recommend changing your index from

index = {term: {doc_id: doc}}

to two indexes one for the terms and then a separate index to hold the values

term_index = {term: set([doc_id])}
doc_store = {doc_id: doc}

that way you don't store multiple copies of the same data

Aaron Goldman
  • 829
  • 9
  • 7