Data structures for fast intersection operations?

Question

Randomly select two sets ,both set contains distinct keys (one key may belongs to multiple sets ,one set can never contain duplicate keys ).

Return a integer which represents for the number of keys belongs to both sets .

For example intersect({1,2,3,4},{3,4,5}) returns 2 .

I just need the size of the intersection .I don't need to know exactly which keys are there in the intersection .

Are there any datastructures support this kind of operations in less than O(n) time ?

Edit:

Reading the data out does requires O(n) time, but that would not lead to the conclusion that you can't do the intersection operation in less than O(n) time .

Image this scenario:

I have N sets,each contains 100 keys . I read them up, that's N*100 operations .Now I want to know witch pair of sets have the largest intersection ,that's O(N²) intersection operations .So I want to reduce the complexity of the intersection operation .I don't really care how much time it takes to read and construct the sets ,it's at most N*100,that's nothing compares to O(N²) intersection operations .

Be aware ,there's no way that you can find the pair of sets that have the largest intersection by doing less than O(N²) intersection operations ,I can prove that .You must do all the intersection operations .

(he basic idea is ,let's imagine a complete graph ,with N vertices ,each represent for a set ,and Nx(N-1)/2 edges, each represents for a intersection for the connected pair .Now give each edge an non-negetive weight all you want(represents for the intersection size) ,I can always construct N sets satisfy those Nx(N-1)/2 edge weights .That proves my claim .)

Do the sets have any potentially useful properties, for example, perhaps they can typically be described as the union of a few ranges? — harold, Aug 25 '16 at 21:16
The only way to achieve a better complexity is to have more information about the data. And even then, I'm not sure you could achieve O(log n). — Nelfeal, Aug 25 '16 at 21:29
I've written a prove below that such algorithm doesn't exist. — xenteros, Aug 25 '16 at 21:30
No matter you slice it, you have to read all of the items. There is no solution that is better than O(n). — Jim Mischel, Aug 25 '16 at 21:45
Do the keys have constraints ? Are they integers from 0 to n, or something similar ? Do all sets always contain the same number of keys as in your example ? Do you have more sets than keys (I'd assume so) ? — Nelfeal, Aug 25 '16 at 22:49
@Nelxiost The only constraint ,is that they are 32bit integers ,nothing else . — iouvxz, Aug 25 '16 at 22:52

score 13 · Accepted Answer · edited Jun 20 '20 at 09:12

I would advice you to take a look at the two possible alternatives, which work particularly well in practice (especially in case of the large sets).

1. The Bloom Filter data structure

A Bloom filter is a space-efficient (based on the bit array) probabilistic data structure, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not.

There is a trade-off between False Positive rate and the memory footprint of the Bloom Filter. Hence, it is possible to estimate the appropriate size of the Bloom Filter for different use cases.

Each set can be associated with its own Bloom Filter. It is very easy to obtain the Bloom Filter, which corresponds to the intersection of the different sets: all bit arrays, which correspond to the different Bloom Filters, can be combined using the bitwise AND operation.

Having the Bloom Filter, which corresponds to the intersection, it is possible to find quickly the items, which are present in all intersected sets.

Apart from that, it is possible to estimate the cardinality of the intersection without the actual iteration over the entire sets: https://en.wikipedia.org/wiki/Bloom_filter#The_union_and_intersection_of_sets

2. The Skip list data structure

A Skip List is a data structure that allows fast search and intersection within an ordered sequence of elements. Fast search and intersection are made possible by maintaining a linked hierarchy of subsequences, each skipping over fewer elements.

Succinctly saying, the Skip List is very similar to the plain Linked List data structure, however each node of the Skip List has a couple of additional pointers to items, which are located further (pointers, which "skips" over the couple of other nodes of the list).

Hence, in order to obtain the intersection - it is needed to maintain the pointers inside all Skip Lists, which are being intersected. During the intersection of the Skip Lists pointers are jumping over the items, which are not present in all intersected Skip Lists. Hence, usually the runtime complexity of the intersection operation is faster then O(N).

The algorithm of the intersection of Skip Lists is described in the book "Introduction to Information Retrieval" (written by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze): http://nlp.stanford.edu/IR-book/html/htmledition/faster-postings-list-intersection-via-skip-pointers-1.html

Skip Lists are actively used in a high-performance, full-featured text search engine library: Apache Lucene (Skip Lists are used inside the Inverted Index component).

Here is an additional Stackoverflow question about the usage of Skip Lists in Lucene: how lucene use skip list in inverted index?

score 2 · Answer 2 · edited Aug 25 '16 at 21:48

2

Let's assume there is an algorithm which allows checking the intersection length in less then O(n) time. Now let's read part of the input. We have two options: We've read whole set and part of another or we've read part of first set and part of the other.

Option 1):

counter-example - let's take such input that there exists an element which was read in set 1 and hasn't been read from set 2 but it is in set 2 - we'll receive incorrect result.

Option 2):

counter-example - we can have input such there exists element that is in two sets but hasn't been read in at least one. We receive incorrect result.

OK, we've proven that there is no such algorithm that returns the correct result when we don't read the whole input.

Let's read the whole input - n numbers. Oops, the complexity is O(n).

End of proof.

edited Aug 25 '16 at 21:48

Eugen Constantin Dinca

8,994
2
34
51

answered Aug 25 '16 at 21:28

xenteros

15,586
12
56
91

Why should I read it again ? I have two sets properly constructs as a tree or a hash map or something . – iouvxz Aug 25 '16 at 21:32
I already have the data in the memory .I just need a quick test to return the size of the intersection . – iouvxz Aug 25 '16 at 21:34
2

Well, the structure might be two sets with additional field for the intersection value. Then it'll be O(1). Simply calculate the intersection at the input. At which moment taking input finishes? – xenteros Aug 25 '16 at 21:36
It's true that, in the general case, you cannot have an algorithm that doesn't read every value in both sets, because one unread value might change the output. It is not true if we have particular constraints on the input. @iouvxz, do you have such constraints ? – Nelfeal Aug 25 '16 at 21:36
2

@Nelxiost Think of this scenario, I have N sets,each contains 100 keys . I read them up, that's N*100 operations .Now I want to know witch pair of sets have the largest intersection ,that's O(N²) intersection operations .So I want to reduce the complexity of the intersection operation .I don't really care how much time it takes to read and construct the sets ,it's at most N*100,that's nothing compares to O(N²) intersection operations . – iouvxz Aug 25 '16 at 21:45
@xenteros If the structure already has an additional field for the intersection value ,then you've done the intersection operation .Just read the result and return it ,is not the intersection operation itself . – iouvxz Aug 25 '16 at 21:51
@iouvxz Then modify you question to state the real problem. The answer to you current question is *no*. The answer to "can I find the pair of sets that have the largest intersection in a less than quadratic time" is probably *yes*. – Nelfeal Aug 25 '16 at 21:58
@Nelxiost No ,you are missing the point . There's no way that you can find the pair of sets that have the largest intersection in a less than quadratic time ,I can prove that .Now the problem is ,if I must do O(N²) intersection operations, each operation time must be minimal ,better less than O(M)(M is the size of each set). – iouvxz Aug 25 '16 at 22:02
@Nelxiost I can prove that ,but I'm not a native english speaker ,it'll be too hard for me to explain my prove to you . – iouvxz Aug 25 '16 at 22:05
@iouvxz Nlexiost is trying to tell you that the whole algorithm for "*I want to know which pair of sets have the largest intersection*" might be different than naively doing N² intersections. – Bergi Aug 25 '16 at 22:05
@Bergi It's the same ,I can prove that , but I can't express myself ,unless in Chinese :-) . – iouvxz Aug 25 '16 at 22:07
@iouvxz: We're not arguing about `O(N²)` being the lower bound, but about the algorithm consisting of N² intersection steps. Using a different approach might lead to an easier understandable algorithm than putting the complicated work in the set structure. – Bergi Aug 25 '16 at 22:10
@Bergi That's all I want ,find out which pair has the largest intersection .That's what I‘m doing . – iouvxz Aug 25 '16 at 22:21
@iouvxz: I guess you should [ask a new question](http://stackoverflow.com/questions/ask) about that then. And don't forget to state what kind of data your sets contain. – Bergi Aug 25 '16 at 22:24
@Bergi I have to do O(N²) intersection operations ,and I have to minimal each operation time .If each intersection operation costs O(M) time (M is the size of the smallest given set), then the whole algorithm is O(N²xM), if each intersection operation costs O(log(M)) time ,then the whole algorithm is O(N²xlog(M)) .Minimal the intersection operation time is the whole point of my program. – iouvxz Aug 25 '16 at 22:40
@iouvxz: "*I have to do O(N²) intersection operations*" - No. Start there with an approach that does not need all of them. – Bergi Aug 25 '16 at 22:48
@Bergi I've proved that ,(the basic idea is ,let's imagine a complete graph ,with N vertices ,each represent for a set ,and Nx(N-1)/2 edges, each represents for a intersection for the connected pair .Now give each edge an non-negetive weight all you want(represents for the intersection size) ,I can always construct N sets satisfy those Nx(N-1)/2 edge weights .That proves my claim .) – iouvxz Aug 25 '16 at 23:06
@iouvxz: Still, to find the maximum edge you do not necessarily have to compute every edge weight *independently*. – Bergi Aug 25 '16 at 23:13
@Bergi Why ?Find the pair of sets that have the largest intersection can be reduced to find the maximum edge with randomly given weight in a complete graph.And find the maximum edge with randomly given weight in a complete graph is equivalent to find the maximum integer in an array . – iouvxz Aug 25 '16 at 23:20
@iouvxz: the edge weights aren't random, and computing them seems to be your actual perf issue. By not doing this computation (intersection) for every of the edges *on its own*, you can speed up the creation of the adjacency matrix. (And with the matrix, finding the maximum is trivial indeed) – Bergi Aug 25 '16 at 23:24
@iouvxz In the worst case ,they are,"give each edge an non-negetive weight all you want(represents for the intersection size) ,I can always construct N sets satisfy those Nx(N-1)/2 edge weights" ,the construction way is the core idea of my provement ,but I need to draw many graphs to explain my construction way . – iouvxz Aug 25 '16 at 23:28
@iouvxz: That reduction just doesn't work, because you're trivialising work ("compute set intersection") to "give weight all you want". This works for establishing a lower bound, it does not work to come up with a proper algorithm. – Bergi Aug 25 '16 at 23:35
@iouvxz: Btw, it seems your actual problem has been solved [here](http://stackoverflow.com/q/2697183/1048572) – Bergi Aug 25 '16 at 23:35
@Bergi Too hard to explain it to you . – iouvxz Aug 25 '16 at 23:45
@Bergi How can you know minimam value in an array without knowing all the values ? – iouvxz Aug 25 '16 at 23:52
@iouvxz: I never said that. I said that you should create the array in a more clever way than blindly running an intersection algorithm for every entry. – Bergi Aug 25 '16 at 23:55
@Bergi But the worst case is that no matter how many edges you've compute ,the left edges can always have random value from zero to infinity , unless you compute all the edges . – iouvxz Aug 25 '16 at 23:58
@iouvxz: yes, worst case is definitely `Ω(N²)`, and you do have to compute all edges indeed. You just don't have to compute them by repeated intersection. – Bergi Aug 26 '16 at 00:03
OK, if I'll construct my N sets this way.First ,I construct Nx(N-1)/2 subsets(I call it the subsets to distinguish form the N sets I actually want to construct) ,each has a pair(i,j), which 0<=i – iouvxz Aug 26 '16 at 00:15
Fourth, for each 0<=i – iouvxz Aug 26 '16 at 00:18
Now you have N sets ,each weights x, and each intersection has a random weight . – iouvxz Aug 26 '16 at 00:19

Data structures for fast intersection operations?

2 Answers2

1. The Bloom Filter data structure

2. The Skip list data structure

Linked