Say I have a large set of arrays (can be up to millions in size), and I want to determine (preferably exactly, although approximately is fine) the array in this set with the largest sized intersection with the input, what would be the most efficient way to do this? I will list some solutions that have crossed my mind at the bottom by reducing this into another problem but I am not sure if they are necessarily the best.
This set of arrays can be stored in any data structure, and the arrays can be sorted and stored in any way. The idea is to optimize query time here.
Example: say my set of arrays is (sorted in radix like manner for convenience, can be sorted in any way chosen):
[('a', 'b'), ('a', 'e', 'f'), ('b', 'f', 'g'), ('b', 'j', 'z'), ('d', 'l', 'f'), ('x', 'y', 'z')]
and my input array is:
('a', 'f')
Then the respective intersections are:
[('a'), ('a', 'f'), ('f'), (), ('f'), ()]
So the output would be ('a', 'f')
, having the largest intersection of size 2. As a bonus, it would be even better to have the largest K
of these, so here, if K = 3, the output would be (in any order):
[('a', 'f'), ('f'), ('a')]
Some possible solutions I have thought of:
- The size of my domain is restricted, (as in it could be a-z or
numbers 1-70, etc) so potentially I can represent these as binary
strings, and the challenge now becomes to find the minimum hammington
distance, which I can now do with something like locality hashing? For example
('a', 'f')
could be represented as10000100000000000000000000
- Also using the fact that the domain is restricted, I can create some inverted index with the items in the domain pointing to the different arrays in the set, and then intersect (at least some of) these results for each item in the input array - although I feel like this would be incredibly inefficient (especially if the intersection turns out to be small) - similar to how a google search work, although I don't know the full details of their algorithm
Thankyou to any responses or pointers in the right direction!