Pairwise Set Intersection in Python

Question

If I have a variable number of sets (let's call the number n), which have at most m elements each, what's the most efficient way to calculate the pairwise intersections for all pairs of sets? Note that this is different from the intersection of all n sets.

For example, if I have the following sets:

A={"a","b","c"}
B={"c","d","e"}
C={"a","c","e"}

I want to be able to find:

intersect_AB={"c"}
intersect_BC={"c", "e"}
intersect_AC={"a", "c"}

Another acceptable format (if it makes things easier) would be a map of items in a given set to the sets that contain that same item. For example:

intersections_C={"a": {"A", "C"},
                 "c": {"A", "B", "C"}
                 "e": {"B", "C"}}

I know that one way to do so would be to create a dictionary mapping each value in the union of all n sets to a list of the sets in which it occurs and then iterate through all of those values to create lists such as intersections_C above, but I'm not sure how that scales as n increases and the sizes of the set become too large.

Some additional background information:

Each of the sets are of roughly the same length, but are also very large (large enough that storing them all in memory is a realistic concern, and an algorithm which avoids that would be preferred though is not necessary)
The size of the intersections between any two sets is very small compared to the size of the sets themselves
If it helps, we can assume anything we need to about the ordering of the input sets.

I would suggest the following: Traverse all sets and build a map by tracking where you find each element. This is O(NlogN) (assuming that the dictionary adds a logarithmic overhead), where N is the total number of elements. — nickie, Dec 09 '14 at 00:20
I've tried the method I described on small samples, but the problem is that a lot of the data I will be using with this is user-fed. I would ideally like to be able to support much larger use cases, so I was wondering if there's a more common/efficient way to do this than the naive approach I described. — ankushg, Dec 09 '14 at 00:21
@nickie Is your idea to traverse the sets and make a dictionary independently for all *n* sets, making the dictionary only of size *m* per iteration rather than *n*m* to store all possible elements? — ankushg, Dec 09 '14 at 00:27
I think this can be done in linear time using a hash table, linear with respect to the size of the sets: O(N + M + N * c), where c is a constant that represents the cost of accessing an entry in the hash table, this constant will be proportional to the length of the strings in you sets. — rendon, Dec 09 '14 at 00:32
It can be done in constant time (assuming set subtraction is constant time) — tox123, Dec 09 '14 at 00:33
I suggested to use a single dictionary for all sets, which would finally contain all elements and would therefore be of size O(n\*m) if you have n sets of roughly the same size m. But you'll have to provide more information, if you want a more educated guess. For example, is it just the n\*(n-1)/2 intersections that you're interested in? Can you afford an extra O(n\*m) space? If yes, you can calculate them all in O(n\*m\*log(n\*m)) time --- assuming that your intersections are small; worst case O(n\*n\*m\*log(n\*m)). — nickie, Dec 09 '14 at 01:03
@nickie: Is there a difference between your suggestion and what I mentioned as a possible strategy below `intersections_C` in my post? The only information that I care about at the end are the intersections of the sets. Unfortunately, I don't know too much about what my constraints are because I haven't been able to test this to scale just yet. Thanks for your help! — ankushg, Dec 09 '14 at 01:08
I assumed that your `intersections_C` dictionary contains only the elements that belong in set C. I suggested one dictionary for the elements contained in the union of all your sets. — nickie, Dec 09 '14 at 01:10
Can we assume anything about the ordering of the set elements themselves? — tzaman, Dec 09 '14 at 01:13
@nickie: Ah ok, yeah I should have clarified my approach better -- I meant to create a dictionary for the union of all the sets, iterate through each set to populate the dictionary with which sets contain each element, and then iterate through the dictionary once more to create dictionaries such as `intersections_C`. — ankushg, Dec 09 '14 at 01:23
@tzaman: If it helps, we can assume that the keys are sorted — ankushg, Dec 09 '14 at 01:23
**1.** Approximately how large is `n` in practice? **2.** If you were to take all of your sets and partition them randomly into two groups `A` and `B`, could `(⋃A) ⋂ (⋃B)` be large? In Python syntax, that's `set().union(*A) & set().union(*B)`. — Veedrac, Dec 10 '14 at 02:37

doug · Accepted Answer · 2014-12-09T21:05:07.670

8

this ought to do what you want

import random as RND
import string
import itertools as IT

mock some data

fnx = lambda: set(RND.sample(string.ascii_uppercase, 7))
S = [fnx() for c in range(5)]

generate an index list of the sets in S so the sets can be referenced more concisely below

idx = range(len(S))

get all possible unique pairs of the items in S; however, since set intersection is commutative, we want the combinations rather than permutations

pairs = IT.combinations(idx, 2)

write a function perform the set intersection

nt = lambda a, b: S[a].intersection(S[b])

fold this function over the pairs & key the result from each function call to its arguments

res = dict([ (t, nt(*t)) for t in pairs ])

the result below, formatted per the first option recited in the OP, is a dictionary in which the values are the set intersections of two sequences; each values keyed to a tuple comprised of the two indices of those sequences

this solution, is really just two lines of code: (i) calculate the permutations; (ii) then apply some function over each permutation, storing the returned value in a structured container (key-value) container

the memory footprint of this solution is minimal, but you can do even better by returning a generator expression in the last step, ie

res = ( (t, nt(*t)) for t in pairs )

notice that with this approach, neither the sequence of pairs nor the corresponding intersections have been written out in memory--ie, both pairs and res are iterators.

edited Dec 09 '14 at 21:05

answered Dec 09 '14 at 01:10

doug

69,080
24
165
199

1

This takes O(n\*n\*m) time, if you have n sets of size m. – nickie Dec 09 '14 at 01:16
1

the time complexity for intersection of two sets, x and y is O(len(x) * len(y)); unfavorable time complexity is inherent in this problem, so the best you can do is not make it worse and just address the constant time factors (eg, don't re-implement the underlying C looks instead use the optimized python functions, like list comprehensions – doug Dec 09 '14 at 01:24
This definitely gets the job done, but it would be super cool if there was a way to use this simple syntax with a much more memory-efficient approach like [tzaman described](http://stackoverflow.com/a/27370005/152514). Thanks! – ankushg Dec 09 '14 at 07:54
have you looked at this memory profile for this solution? nothing is written out in memory here. the pairs object is an iterator (python3) w/ just the set indices not the sets themselves; likewise, see my amended answer, last line can be changed slightly to return a generator expression, holding tuples rather than k:v, expression (vs. dict) – doug Dec 09 '14 at 08:19
1

Ah nice! One more thing that I noticed is that we could cut the number of intersection calculations in half by using combinations rather than permutations of the sets – ankushg Dec 09 '14 at 08:26
Why aren't you using a dictionary comprehension (aka. `res = {t: nt(*t) for t in pairs}`)? – Veedrac Dec 10 '14 at 02:14
@doug The time complexity for the intersection of two sets is `O(min(len x, len y))`, not `O(len x · len y)`. Further, one should be able to reduce that cost significantly if the intersections are known to be small and intermediates can be re-used. – Veedrac Dec 10 '14 at 02:33

score 3 · Answer 2 · answered Dec 09 '14 at 01:27

3

If we can assume that the input sets are ordered, a pseudo-mergesort approach seems promising. Treating each set as a sorted stream, advance the streams in parallel, always only advancing those where the value is the lowest among all current iterators. Compare each current value with the new minimum every time an iterator is advanced, and dump the matches into your same-item collections.

answered Dec 09 '14 at 01:27

tzaman

46,925
11
90
115

This was the next thing I was considering -- the dictionary idea (where the dictionary is keyed on the union of all *n* sets) seemed easier to conceptualize and implement, but I felt that intuitively this would save some time and memory consumption. Any idea how to quantify the savings that this approach has over the dictionary method? – ankushg Dec 09 '14 at 03:25
1

The primary advantage of a streaming approach is that you only need to have one item from each set in memory at a time. The dictionary approach uses far more than that: ~O(number of unique elements * average membership). You can even write out the intersections themselves to a set of files if necessary. – tzaman Dec 09 '14 at 03:49

score -4 · Answer 3 · answered Dec 09 '14 at 05:11

-4

How about using intersection method of set. See below:

A={"a","b","c"}
B={"c","d","e"}
C={"a","c","e"}

intersect_AB = A.intersection(B)
intersect_BC = B.intersection(C)
intersect_AC = A.intersection(C)

print intersect_AB, intersect_BC, intersect_AC

answered Dec 09 '14 at 05:11

Ajit Vaze

2,686
2
20
24

1

The example that I gave was intended to be a generalized example (I'll have many more than just sets A, B, and C), and I wanted to avoid redoing work if possible because the size of my sets can potentially be enormous. – ankushg Dec 09 '14 at 05:33

Pairwise Set Intersection in Python

3 Answers3

Linked