2

This question is can be viewed continuation/extension/generalization of a previous question of mine from here.

Some definitions: I have a set of integers S = {1,2,...,s}, say s = 20, and two matrices N and M whose rows are finite sequences of numbers from S (i.e. permutations with possible repetitions), of order n and m respectively, where 1 <= n <= m. Let us think of N as a collection of candidate sub-sequences for the sequences from M.

Example: [2 3 4 3] is a sub-sequence of [1 2 2 3 5 4 1 3] that occurs with multiplicity 2 (=in how many different ways one can find the sub-seq. in the main seq.), whereas [3 2 2 3] is not a sub-sequence of it. In particular, a valid sub-sequence by definition must preserve the order of the indices.

Problem statement:

(P1) For each row of M, obtain the number of sub-sequences of it, with multiplicity and without multiplicity, that occur in N as rows (it can be zero if none are contained in N);

(P2) For each row of N, find out how many times, with multiplicity and without multiplicity, it is contained in M as a sub-sequence (again, this number can be zero);

Example: Let N = [1 2 2; 2 3 4] and M = [1 1 2 2 3; 1 2 2 3 4; 1 2 3 5 6]. Then (P1) returns [2; 3; 0] for 'with multiplicities' and [1; 2; 0] for 'without multiplicities'. (P2) returns [3; 2] for 'with multiplicities' and [2; 1] without multiplicities.

Order of magnitude: M could typically have up to 30-40 columns and a few thousand rows, although I currently have M with only a few hundred rows and ~10 columns. N could be approaching the size of M or could be also much smaller.

What I have so far: Not much, to be honest. I believe I might be able to slightly modify my not-very-well-vectorized solution from my previous question to tackle permutations with repetitions, but I am still thinking on that and will update as soon as I have something working. But given my (lack of) experience so far, it would be in all likelihood very suboptimal :(

Thanks!

Community
  • 1
  • 1
M.G.
  • 293
  • 1
  • 13
  • What about GPU? Can we use gpuArrays here? – Divakar Sep 10 '14 at 08:27
  • Yes, most happily whenever possible :) – M.G. Sep 10 '14 at 08:43
  • In the (P1) example, why is the second result 1? Second row of M contains subsequences 1 2 2 and 2 3 4, so the result for that row should be 2? – Luis Mendo Sep 10 '14 at 10:03
  • Ah, thanks, sorry, it was a typo. I hope it is fixed now. – M.G. Sep 10 '14 at 10:09
  • 1
    @LuisMendo Sorry, shouln'd it be 3 (with multiplicities)? The subsequence 2 3 4 appears twice in 2nd row of M, right? – Luis Mendo Sep 10 '14 at 10:11
  • Indeed, and for (P2) as well :) Can't get my own examples right... – M.G. Sep 10 '14 at 10:15
  • It's easy to make mistakes doing it by hand :-) I just wanted to make sure I was getting it correctly – Luis Mendo Sep 10 '14 at 10:15
  • Sorry for the delay, boys, I had not forgotten about the question :) I wish I could accept both answers. Even though I use `VChooseK` for much better performance than `nchoosek` and `uint8/16` for better memory management rather than the default `double`, given the order of magnitude specified above, something like `nchoosek(1:40,20)` quickly becomes impractical. So, in view of this I decided to select Divakar's answer. Having said that, it will take me some time to follow through the code of both answers – M.G. Sep 16 '14 at 10:22
  • Was a tough question, but still very interesting! Going through your trend, I am already a bit scared of your next question! ;) But looking forward to it! – Divakar Sep 16 '14 at 10:28

2 Answers2

1

This approach uses only one loop over the rows of M (P1) or N (P2). The code makes use of linear indexing and the very powerful bsxfun function. Note that if the number of columns is large you may experience problems because of nchoosek.

[mr mc] = size(M);
[nr nc] = size(N);

%// P1
combs = nchoosek(1:mc, nc)-1;
P1mu = NaN(mr,1);
P1nm = NaN(mr,1);
for r = 1:mr
    aux = M(r+mr*combs);
    P1mu(r) = sum(ismember(aux, N, 'rows'));
    P1nm(r) = sum(ismember(unique(aux, 'rows'), N, 'rows'));
end

%// P2. Multiplicity defined to span across different rows
rr = reshape(repmat(1:mr, size(combs,1), 1),[],1);
P2mu = NaN(nr,1);
P2nm = NaN(nr,1);
for r = 1:nr
    aux = M(bsxfun(@plus, rr, mr*repmat(combs, mr, 1)));
    P2mu(r) = sum(all(bsxfun(@eq, N(r,:), aux), 2));
    P2nm(r) = sum(all(bsxfun(@eq, N(r,:), unique(aux, 'rows')), 2));
end

%// P2. Multiplicity defined restricted to within one row
rr = reshape(repmat(1:mr, size(combs,1), 1),[],1);
P2mur = NaN(nr,1);
P2nmr = NaN(nr,1);
for r = 1:nr
    aux = M(bsxfun(@plus, rr, mr*repmat(combs, mr, 1)));
    P2mur(r) = sum(all(bsxfun(@eq, N(r,:), aux), 2));
    aux2 = unique([aux rr], 'rows'); %// concat rr to differentiate rows...
    aux2 = aux2(:,1:end-1); %// ...and now remove it
    P2nmr(r) = sum(all(bsxfun(@eq, N(r,:), aux2), 2));
end

Results for your example data:

P1mu =
     2
     3
     0
P1nm =
     1
     2
     0
P2mu =
     3
     2
P2nm =
     1
     1
P2mur =
     3
     2
P2nmr =
     2
     1

Some optimizations to the code would be possible. Not sure they are worth the effort:

  • Replace repmat by another bsxfun (using a 3rd dimension). That may save some memory
  • Transpose original matrices and work down colunmns, instead of along rows. That may be faster.
Luis Mendo
  • 110,752
  • 13
  • 76
  • 147
  • Thanks! One question though: shouldn't `P2nm` be `[2; 1]` (since `[1 2 2]` occurs in two different rows) ? – M.G. Sep 10 '14 at 10:49
  • @July I've corrected it. I keep the two approaches to P2: with multiplicity defined to span across different rows, or restricted to within one row – Luis Mendo Sep 10 '14 at 11:02
  • If I understand you correctly, when you allow multiplicity to span across different rows and don't count multiplicities, then a sub-sequence can occur at most 1 time (1 for "it's present in M somewhere" and 0 for "it's not present in M at all") ? – M.G. Sep 10 '14 at 11:26
  • @July Yes, in that case the only possible results are 0 or 1 – Luis Mendo Sep 10 '14 at 11:36
1

Introduction : Owing to the repetitions in the input data in each row, the combination finding process doesn't have the sort of "uniqueness" among elements which was exploited in your previous problem and hence the loops used here. Also, note that the without multiplicity codes don't use nchoosek and as such, I feel more optimistic about them for performance.

Notations :

p1wim -> P1 with multiplicity
p2wim -> P2 with multiplicity
p1wom -> P1 without multiplicity
p2wom -> P2 without multiplicity

Codes :

I. Code for P1, 2 with multiplicity

permN = permute(N,[3 2 1]);

p1wim(size(M,1),1)=0;
p2wim(size(N,1),1)=0;

for k1 = 1:size(M,1)
    d1 = nchoosek(M(k1,:),3);
    t1 = all(bsxfun(@eq,d1,permN),2);
    p1wim(k1) = sum(t1(:));
    p2wim = p2wim + squeeze(sum(t1,1));
end

II. Code for P1, 2 without multiplicity

eqmat = bsxfun(@eq,M,permute(N,[3 4 2 1])); %// equality matrix

[m,n,p,q] = size(eqmat); %// get sizes
inds = zeros(size(M,1),p,q); %// pre-allocate for indices array

vec1 = [1:m]'; %//' setup constants to loop
vec2 = [0:q-1]*m*n*p;
vec3 = permute([0:p-1]*m*n,[1 3 2]);

for iter = 1:p
    [~,ind1] = max(eqmat(:,:,iter,:),[],2);
    inds(:,iter,:) = reshape(ind1,m,1,q);
    ind2 = squeeze(ind1);
    ind3 = bsxfun(@plus,vec1,(ind2-1)*m); %//' setup forward moving equalities
    ind4 = bsxfun(@plus,ind3,vec2);
    ind5 = bsxfun(@plus,ind4,vec3);
    eqmat(ind5(:)) = 0;
end

p1wom = sum(all(diff(inds,[],2)>0,2),3);
p2wom = squeeze(sum(all(diff(inds,[],2)>0,2),1));

As usual, I would encourage you to use gpuArrays too with your favorite parfor.

Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Sorry for my delay. You are right, for the specified order of magnitude for `M` and `N`, the below usage of `nchoosek` could quickly make the code impossible, e.g. `nchoosek(1:40,20)`. In that regard, your `nchoosek` usage is much better behaved. Actually, I'd also replace it by `VChooseK` as the latter is much, much faster. – M.G. Sep 16 '14 at 10:27
  • @July Didn't know about `VChooseK` until till point as I found it on MATLAB FEX. So, it appears to be MEXed, which is great and yeah might be helpful with performance. I dearly wished to get rid of it. – Divakar Sep 16 '14 at 10:30