How to find smallest number of lists needed to cover all elements in another list

Question

I'm working on a code using Matlab in which I need to find the least number lists (in some set of given lists) necessary to cover all the elements of a reference list.

For example, say my reference list is

X = [0 1 2 3 4 5 6 7 8 9]

And I have a given set of lists as follows:

A = [0 1 3 5 6 7 9]
B = [0 1 2 3 4]
C = [5 6 7 8 9]
D = [1 2 3 4]
E = [1 5 7 8]

The smallest number of lists needed to cover every element in X is 2 (B and C), however, if I initially only search for the list that covers the most elements (A) and then try to find other lists that will cover the remaining elements, I'll end up using at least 3 lists. What would be the best way to write a code that can search for the smallest number of lists necessary for this (it would give me an output of B and C)? Any help at all would be greatly appreciated...even just a conceptual explanation (not actual code) of how to best approach this problem would be a huge help!

First suggestion, don't use `A`, `B`, `C`, etc... but use a cell array i.e. `lists={[0, 1, 3, 5, 6, 7, 9]; [0, 1, 2, 3, 4]; [5, 6, 7, 8, 9],...}` as that way you can easily loop through combination by indexing `lists`. Next suggestion, post your attempt at solving it, even if it's just brute force or even if it's not correct just to show you've really given it a shot. I don't think this is an easy problem to solve btw but at least demonstrate that you've taken a crack at it. — Dan, Mar 31 '16 at 06:54
Some functions you might find useful: [`union`](http://www.mathworks.com/help/matlab/ref/union.html), [`isequal`](http://www.mathworks.com/help/matlab/ref/isequal.html) or [`setdiff`](http://www.mathworks.com/help/matlab/ref/setdiff.html), [`nchoosek`](http://www.mathworks.com/help/matlab/ref/nchoosek.html) — Dan, Mar 31 '16 at 07:27
@Dan regarding `union` - initially thought so too, but it's only good for two inputs at a time (where in this scenario the user can have up to 5, so it complicates things). I would rather go for simple concatenation and `unique`. @matlab137 Also see [`perms`](http://www.mathworks.com/help/matlab/ref/perms.html). — Dev-iL, Mar 31 '16 at 07:34
@Dev-iL Theoretical you can use the form `union(A,union(B,C))`, but here you have too many options and you don't know how many `unions` you need... — Adiel, Mar 31 '16 at 10:34
As @Amro points out, this is the Set Cover problem, which is an NP-hard problem. That means in effect that *every* algorithm that can solve this problem will take exponential time (in the size of the problem) for at least some inputs -- or IOW, don't worry if your algorithm runs slowly. — j_random_hacker, Mar 31 '16 at 16:24

score 4 · Answer 1 · edited Jun 20 '20 at 09:12

Approach #1: Iterative "brute-force" of all possible combinations

Below is one possible algorithm that illustrates how to solve the problem. The code itself should be self-explanatory, but the idea is that we test all possible combinations of lists until a valid one is found (hence we don't encounter the problem you described where we mistakenly choose lists based on their length).

function varargout = q36323802
R = [0 1 2 3 4 5 6 7 8 9]; %// Reference List

L = {... // As per Dan's suggestion:
 [0 1 3 5 6 7 9]
 [0 1 2 3 4]
 [5 6 7 8 9]
 [1 2 3 4]
 [1 5 7 8]
 };
out = []; %// Initialize output
%% // Brute-force approach:
nLists = numel(L);
for indN = 1:nLists
  setCombinationsToCheck = nchoosek(1:nLists,indN);  
  for indC = 1:size(setCombinationsToCheck,1)
    u = unique(cat(2,L{setCombinationsToCheck(indC,:)}));
    if all(ismember(R,u))
      out = setCombinationsToCheck(indC,:);
      disp(['The minimum number of required sets is ' num2str(indN) ...
      ', and their indices are: ' num2str(out)]);
      return;
    end
  end
end
disp('No amount of lists found to cover the reference.');

if nargout > 0 
  varargout{1} = out;
end

For your example the output is:

The minimum number of required sets is 2, and their indices are: 2 3

Note(s):

This method does some redundant computations by not using lists of length n-1 in iteration n, which were already found in previous iterations (when applicable). A recursive solution may work in this case.
There is probably a way to vectorize this, which I did not really think about in depth.
I assumed all inputs are row vectors. There would have to be some extra steps if this is not the case.

Thanks go to Adiel for suggesting some improvements, and for Amro for finding some bugs!

Approach #2: Tree search ^{_Experimental}

I've attempted to also build a recursive solver. Now it finds a solution, but it's not general enough (actually the problem is that it only returns the first result, not necessarily the best result). The reasoning behind this approach is that we can treat your question as a tree search problem, and so we can employ search/pathfinding algorithms (see BFS, DFS, IDS etc.). I think the algorithm below is closest to DFS. As before, this should mainly illustrate an approach to solving your problem.

function q36323802_DFS(R,L)
%% //Input checking:
if nargin < 2 || isempty(L)
  L = {... // As per Dan's suggestion:
   [0 1 3 5 6 7 9]
   [0 1 2 3 4]
   [5 6 7 8 9]
   [1 2 3 4]
   [1 5 7 8]
   };
end
if nargin < 1 || isempty(R)
  R = [0 1 2 3 4 5 6 7 8 9]; %// Reference List
end
%% // Algorithm (DFS: breadth-first search):
out = DFS_search(R,L,0);
if isempty(out)
  disp('No amount of lists found to cover the reference.');
else
  disp(['The minimum number of required sets is ' num2str(numel(out)) ...
    ', and their indices are: ' num2str(out)]);
end
end

function out = DFS_search(R,L,depth)
  %// Check to see if we should stop:
  if isempty(R) || isempty(L)
    % // Backtrack here?
    out = [];
    return;
  end
  if isnan(R)
    out = [];
    return;
  end

  nLists = numel(L);
  reducedR = cellfun(@(R,L)setdiff(R,L),repmat({R},[nLists,1]),L,'UniformOutput',false)';
 %'//  We consider a case where the reduction had no effect as "hopeless" and
  %// "drop" it.
  isFullCoverage = cellfun(@isempty,reducedR);
  isHopeless = cellfun(@(R)all(isnan(R)),reducedR) | cellfun(@(rR)isequal(rR,R),reducedR);
  reducedR(isHopeless) = deal({NaN});
  if all(isHopeless) && ~any(isFullCoverage)
    out = [];
    return
  end
  if any(isFullCoverage) %// Check current "breadth level"
    out = find(isFullCoverage,1,'first');
    return
  else
    for indB = 1:nLists
      out = DFS_search(reducedR{indB},L,depth+1);
      if ~isempty(out)
        out = [indB out]; %#ok
        %// TODO: test if one of the sets is covered by the others and remove it
        %// from the list "out".
        %// Also, keep track of the best path and only return (finally) if shortest
        return
      end
    end
  end  
end

I think that it better use `ismember` then `isequal`, to include the option that the combined lists include the reference list and additional numbers... — Adiel, Mar 31 '16 at 10:38
And, for the first note, why not use `nchoosek` instead of `prems`+`unique`+`sort`? You can do it by `setCombinationsToCheck=nchoosek([1:nLists],indN);` — Adiel, Mar 31 '16 at 11:07
@עדיאל - Thanks you for you comments. I'll update my answer. Regarding `nchoosek` - I'm simply unfamiliar with its use since I've been using the [`VChooseK` family of functions by Jan Simon](http://www.mathworks.com/matlabcentral/fileexchange/?utf8=%E2%9C%93&sort=downloads_desc&term=authorid%3A15233+Choose) instead, which supposedly works better. — Dev-iL, Mar 31 '16 at 11:18
And I didn't know about Jan's functions... seems interesting, nice to know. תודה :) — Adiel, Mar 31 '16 at 11:22

score 1 · Answer 2 · edited Jan 18 '21 at 12:35

A similar solution to Dev-iL's 1^st approach, by Amro:

function varargout = q36323802A
R = [0 1 2 3 4 5 6 7 8 9];

names = {'A' 'B' 'C' 'D' 'E'};
L = {...
    [0 1 3 5 6 7 9]
    [0 1 2 3 4]
    [5 6 7 8 9]
    [1 2 3 4]
    [1 5 7 8]
};
N = numel(L);

%// powerset of L: set of all subsets (excluding empty set)
powerset = cell(1,N);
for k=1:N
    sets = nchoosek(1:N, k);
    powerset{k} = num2cell(sets,2);
end
powerset = cat(1, powerset{:});

%// for each possible subset, check if it covers the target R
mask = false(size(powerset));
for i=1:numel(powerset)
    elems = unique([L{powerset{i}}]);
    mask(i) = all(ismember(R, elems));
end
if ~any(mask), error('cant cover target'); end

%// from candidates, choose the one with least amount of sets
candidates = powerset(mask);
len = cellfun(@numel, candidates);
[~,idx] = min(len);
out = candidates{idx};
varargout{1} = names(out);

How to find smallest number of lists needed to cover all elements in another list

2 Answers2

Approach #1: Iterative "brute-force" of all possible combinations

Note(s):

Approach #2: Tree search Experimental

Approach #2: Tree search ^{_Experimental}