1

Say I have a large set of arrays (can be up to millions in size), and I want to determine (preferably exactly, although approximately is fine) the array in this set with the largest sized intersection with the input, what would be the most efficient way to do this? I will list some solutions that have crossed my mind at the bottom by reducing this into another problem but I am not sure if they are necessarily the best.

This set of arrays can be stored in any data structure, and the arrays can be sorted and stored in any way. The idea is to optimize query time here.

Example: say my set of arrays is (sorted in radix like manner for convenience, can be sorted in any way chosen):

[('a', 'b'), ('a', 'e', 'f'), ('b', 'f', 'g'), ('b', 'j', 'z'), ('d', 'l', 'f'), ('x', 'y', 'z')]

and my input array is:

('a', 'f')

Then the respective intersections are:

[('a'), ('a', 'f'), ('f'), (), ('f'), ()]

So the output would be ('a', 'f'), having the largest intersection of size 2. As a bonus, it would be even better to have the largest K of these, so here, if K = 3, the output would be (in any order):

[('a', 'f'), ('f'), ('a')]

Some possible solutions I have thought of:

  • The size of my domain is restricted, (as in it could be a-z or numbers 1-70, etc) so potentially I can represent these as binary strings, and the challenge now becomes to find the minimum hammington distance, which I can now do with something like locality hashing? For example ('a', 'f') could be represented as 10000100000000000000000000
  • Also using the fact that the domain is restricted, I can create some inverted index with the items in the domain pointing to the different arrays in the set, and then intersect (at least some of) these results for each item in the input array - although I feel like this would be incredibly inefficient (especially if the intersection turns out to be small) - similar to how a google search work, although I don't know the full details of their algorithm

Thankyou to any responses or pointers in the right direction!

NightShade
  • 422
  • 5
  • 19

2 Answers2

2

Some questions beforehand that I couldn't ask via comment due to lacking reputation:

  1. All arrays are unique but is every array a set itself?
  2. If more than one array shares the largest intersection size do you need to list them all?
  3. Your input might be longer than the longest given array?

Iteration

Without hashset I would sort the arrays by length and start with the longest arrays to possibly skip shorter arrays in the end by finding an intersection size that is simply larger or equal to the shorter arrays' sizes.

If you also sort the arrays themselves you could make use of the Hammington distance but you don't have to sort and convert all arrays at the same time but start with a share of them only. If you don't use the Hammington keep in mind that if you compare your input with an array that is your input in size + 1 you only have to compare until you hit the first comparison where your input's last element is smaller than the current array element.

a f

a c k z // since k > f we don't need to compare f and z

I would think this way would boil down to a complexity of O(n lg n) since sorting the arrays by size would be O(n lg n), calculating the size n * O(1) and doing an inner radix sorting O(n). The comparison itself would be O(n lg n) (not too sure about this one) so the total would be O(n lg n) * 2 + 2 * O(n) => O(n lg n).

Tree

Just a rough idea: You could sort all arrays with Radix and transform them into Hemmington and from there fill a tree with them and traverse it until no further traversing would lead to a smaller distance. How efficient this is I have no idea.

https://stackoverflow.com/a/6390606/9758920

Community
  • 1
  • 1
claasic
  • 1,050
  • 6
  • 14
  • 1. Each array can be stored as a set, yes (or anything really - the structure of the data is still up to me to decide) 2. I only need to list one of these (or at least K where K is the K best matches for the user) 3. The input most certainly may be shorter, longer or the same size as any of the arrays contained within the set These are certainly some amazing optimizations (and even the small pre-processing of data needs to be done once). Actually the best way I think it can be done (so far). I would say given my data-set on average would still have to iterate fully through though – NightShade Jun 04 '19 at 09:21
  • Ran out of characters in old comment, would this be some sort of M-tree? how would this tree implementation work? any divide & conquer like solution would be amazing. I don't fully understand the implementation but given an algorithm I can do the research – NightShade Jun 04 '19 at 09:34
0

I would suggest a straight-forward approach using hashsets.
If the hashset is well implemented, with a good hash function, then we can consider that checking if an element is part of this set can be done in O(1).
Then, we can do the following:

function find_closest_arrays(A, B_1, ..., B_n) {
    result = [0, ..., 0] // array of size n
    for elem in A {
        for i in 1 ... n {
            if elem is in B_i {
                result[i] ++
            }
        }
    }
    return result
}

This function return an array result. result[i] contains the number of elements in common between the input array A and B_i.
From here, getting the k best is quite immediate, all you have to do is to get the indices of the k biggest number in the result.
The time complexity of this algorithm is O(n * m), with m the size of the input array, and n the size of the set of arrays.

m.raynal
  • 2,983
  • 2
  • 21
  • 34
  • I felt this was quite "brute force-ish", and while it does work, I came asking to see if there would be a better approach. this is essentially a linear scan, and over few hundred thousand to million entries can be quite expensive edit: essentially the closer to some divide & conquer like approach the better. One think I considered for example, is if the input array is as above we wouldn't need to consider `('x','y','z')` or anything beyond there. This was as far as I could get with this thinking – NightShade Jun 04 '19 at 09:08
  • I'm not sure that you can have better performance than that in the general case, because if we consider that all the arrays have more or less the same size `m` (not only the input array), then `O(n * m)` is the size of the input of the program, and thus a faster algorithm can't exist (we must at least read all the program's input). But it is true that in some specific cases, you might implement some tricks. – m.raynal Jun 04 '19 at 09:43
  • If you want it to run faster, you can parallelize the code trivially. – m.raynal Jun 04 '19 at 09:45
  • The solution above seems to work well making use of the Hammington distance, but this could also be because I know something about the domain of the data. The other thing here is, that n is already in storage, and generally sorting something tells us something about the structure of the data to "prune" some off - meaning we don't even need to consider it. I wasn't sure of an implementation for this, hence why I came here. – NightShade Jun 04 '19 at 09:46