Algorithm for finding the biggest subset (kernel) of a set of arrays

Question

Let S be a set of arrays, we want to find the largest subset of values that appear in each of the arrays in S. values may appear more than once, if a value appears atleast x times in each of the arrays, it should appear x times in the output set.

This problem can be seen as finding the kernel of S.

For the sake of complexity computations, assume S contains n arrays, all of them are of size m.

Example:

A = {4, 3, 8, 12, 5, 4, 4, 4}
B = {5, 4, 12, 1, 1, 4, 1, 1}
C = {2, 11, 4, 8, 4, 5, 2, 8}

Answer = {4, 4, 5} // order doesn't matter

A couple of remarks:

If the values would have been unique in each array, we could have used a hashtable to store all the values, then count the number of appearances of each of them, if it number of appearances was equal to the number of arrays in S, it would enter the answer.
If we would want the unique values set (i.e. in the example the answer would have been {4, 5}), we could have used a hashtable for each array to check if a value was already added to the "big" hashtable (from the previous bullet) to prevent double insertion of a value that appeared more than once in an array (or any other prevention of duplicate insertions).

Any suggestions of an efficient (time and memory) algorithm to perform this task?

Given my answer here : http://stackoverflow.com/a/19814197/905902 , your problem reduces to finding the GCD. — wildplasser, Nov 06 '13 at 14:32
Consider asking cs-related questions on [cs.stackexchange.com](http://cs.stackexchange.com). — Realz Slaw, Nov 10 '13 at 19:17

Bernhard Barker · Accepted Answer · 2013-11-06T16:32:24.120

Hashing-based approach

Create a hash table of values to counts of the first array.

For each entry in the hash table, also store a total, which should be set to be the same as the count.

For each next array:

Reset the count of each element in the hash table.
Go through the array, setting up the counts in the hash table.
For each entry in the hash table, set total = min(count, total)
(remove entries with total = 0)

Complexity:

O(n) space and (expected) time complexity where n is the total number of elements.

Sorting-based approach

Heapify each of the arrays (to a min-heap) (can also work with sorting, but that explanation was a bit confusing).

Let max = the largest of the minimums of each of the heaps and maxIndex = the index of that heap.
Loop over the heaps starting from maxIndex + 1 with wrap-around until we run out of values.
- If we got all the way back to maxIndex, this means all the values were the same, so output the last popped value.
- If the current heap's minimum is > max, set max and maxIndex appropriately
  (but don't delete the minimum).
- Otherwise delete the minimum from the current heap.

Or (slightly less efficiently):

Sort all the arrays.

Keep a binary search tree (BST) of all current elements (initialized to contain the first element of each array).
Repeatedly remove the minimum from the BST and add the next element from whichever array the removed element was from.
Whenever the minimum and the maximum of the BST are the same, all the nodes are equal, thus we output that value.
To deal with the duplicates, we need to keep a count of how many elements was added since the last time we output a value. Only output a value when that count is greater than or equal to the number of arrays.

Complexity:

O(m) space and O(m n log n) time complexity where m is the number of arrays and n is the average number of elements per array.

The hashing approach is good, but the sorting-based approach can be improved: you don't need a BST. Just sort (or heapify) each array, and observe that the largest minimum across all these heaps is the lowest value that could possibly be shared by all of them. So just keep track of largestMin, plus the number of heaps nWithMaxMin whose minimum is equal to it. When nWithMaxMin == n, you have a shared item, and the multiplicity will be the minimum number of times you need to `deleteMin()` any of the heaps before it yields a different value. — j_random_hacker, Nov 06 '13 at 16:08

score 0 · Answer 2 · edited May 23 '17 at 12:21

encode the sets as products of primes
calculate the common subsets as the GCD of the resulting numbers
factorise the resulting GCD to reveal the common items.
this is of course limited by the number of items, and the sizes of the sets (otherwise the products of primes could overflow)

#include <stdio.h>
#include <string.h>

unsigned primes [] = { 2,3,5,7,11,13,17, 19, 23, 29,31,37,41,43,47};
unsigned value(unsigned array[], unsigned count);
unsigned gcd(unsigned a, unsigned b);
unsigned factorise(unsigned result[], unsigned val);


unsigned value(unsigned array[], unsigned count)
{
unsigned val, idx;

val = 1;
for (idx = 0; idx < count; idx++) {
        val *= primes [ array[idx]];
        }

return val;
}

unsigned gcd(unsigned a, unsigned b)
{
if (b > a) return gcd (b, a);
if (b <= 1) return a;
if (b == a) return a;
return gcd (b, a-b);
}

unsigned factorise(unsigned result[], unsigned val)
{
unsigned idx, count;

for (count=idx=0; val > 1; ) { 
        if ( val % primes[idx] ) idx++;
        else {
                result[count++] = idx;
                val /= primes[idx];
                }
        }
return count;
}

int main(void)
{
unsigned one[]  = {4, 3, 8, 12, 5, 4, 4, 4};
unsigned two[]  = {5, 4, 12, 1, 1, 4, 1, 1};
unsigned three[]  = {2, 11, 4, 8, 4, 5, 2, 8};
unsigned result[12]  , count, idx;;

unsigned val1, val2,val3
        , gcd12 , gcd23 , gcd13, gcd123;

val1 = value(one, 8);
val2 = value(two, 8);
val3 = value(three, 8);

gcd12 = gcd(val1,val2);
gcd23 = gcd(val2,val3);
gcd13 = gcd(val1,val3);
gcd123 = gcd(gcd12,gcd23);

fprintf(stdout, "Val1=%u, Val2=%u Val3=%u gcd= [%u %u %u]: %u\n"
        , val1, val2, val3
        , gcd12 , gcd23 , gcd13, gcd123);

count = factorise( result, gcd123);

for (idx=0; idx < count; idx++) {
        fprintf(stdout, "%c %u", idx? ',' : '{', result[idx] );
        }
fprintf(stdout, "}\n" );

return 0;
}

This is based on my my answer here.

Algorithm for finding the biggest subset (kernel) of a set of arrays

2 Answers2

Hashing-based approach

Sorting-based approach