4

I have a set of integers, say S = {1,...,10}, and two matrices N and M, whose rows are some (but not necessarily all possible) permutations of elements from S of orders, say, 3 and 5 respectively, e.g. N = [1 2 3; 2 5 3;...], M = [1 2 3 4 5; 2 4 7 8 1;...].

A sub-permutation Q of a permutation P is just an indexed subset of P such that the order of the indices of the elements of Q is the same as the order of their indices in P. Example: [2,4,7] is a sub-permutation of [2,3,4,6,7,1], but [1,2,3] is not a sub-permutation of the latter.

I need an efficient way (e.g. as vectorized as possible and as small for-loops as possible) to find

(1) all permutations from M which have sub-permutations from N

and

(2) how many times each sub-permutation from N is to be found in M.

So far, what I have is a vectorized code that checks if a given single sub-permutation is contained in M (and how many times), but then I have to use a parfor-loop through N, which gets slow for very big N-s. Note that if N is not too big, one could also attack the problem by simply constructing the admissible 5-tuples from the given 3-tuples and compare the result to M, but that can quickly become much slower than simple brute-forcing if N is sufficiently big.

An alternative way to view the problem is as follows: check if N modulo permutations of its rows is a sub-matrix of M in a general sense, i.e. if it is possible to obtain a permutation of the rows of N by deleting elements from M.

Apologies if my question is too elementary, my background is from arithmetic algebraic geometry and representation theory and I am very new to MATLAB.

EDIT: Here is my code for checking presence of a single k-tuple in M: [code]

function [A,f] = my_function(x,M)
%// returns all rows in M that contain x and the absolute frequency of x in M
%// suboptimal for checking combinations rather than permutations byy at least ~ 50%
k = size(x,2);
m = size(M,1);
R = zeros(m,k);
I = R;
Z = I;
    for j = 1:k   
        [R(:,j),I(:,j)] = max((M == x(j)),[],2); 
        Z(:,j) = R(:,j).*I(:,j);
    end
z = zeros(m,k-1);
    for j = 1:(k-1)
        z(:,j) = (Z(:,j) > 0 & Z(:,j) < Z(:,j+1)); 
    end
[v,~] = find(sum(z,2) == k-1);    
A = M(v,:);
f = length(v);
end

Using this function, checking through N is then just a simple matter of a (par)for-loop, which I was hoping to avoid in favor of a faster vectorized solution.

Divakar
  • 218,885
  • 19
  • 262
  • 358
M.G.
  • 293
  • 1
  • 13
  • Would a complexity of O( rows(N) * elements(M) ) be satisfactory? – Abhishek Bansal Aug 19 '14 at 12:59
  • Depends on the implicit constant :) What do you have in mind? For a single row from N, my Matlab code for finding it in M is already vectorized, meaning a multitude of parallel execution, so I believe elements(M) per se is a bit too big (though it still could be O(elements(M)), but I am not familiar with the internals of Matlab and even less so with their complexity). On the other hand, a (par)for-loop through N will do like (rows(N)/logical_cores)-checks. My question is really about vectorization of the N-side of the problem since that should be supposedly faster than a (par)for-loop. – M.G. Aug 19 '14 at 13:30
  • `parfor` may not be able to give you a big speed-up, since the parallelized operation itself might be so short that you cannot get past the overhead. – Jonas Aug 19 '14 at 13:33
  • @July My solution is simply to take a row from N and M each and check whether Rn is a sub-permutation of Rm. This can be done by examining each element of Rm. So for the entire two matrices, the complexity shall be O( rows(N) * elements(M) ) – Abhishek Bansal Aug 19 '14 at 13:36
  • If you have a working code that solves this problem, could you share it? Also, would you be okay with using `gpuArrays`? – Divakar Aug 19 '14 at 14:15
  • @user1990169: I see. That would be too slow though. I know that for a fact because that was my first solution and it took 3 days to finish. After vectorization of the code part that checks presence of a given triple in the whole of M via matrix operations on M rather than comparing it with each row of M, the check completed in less than 30 minutes. – M.G. Aug 19 '14 at 14:27
  • @Jonas: I am yet to benchmark 'for' vs 'parfor' in that specific problem with tic-toc. However, benchmarks on similar problems have shown clear advantage of 'parfor'. That might well be due to the age of my workstation :) – M.G. Aug 19 '14 at 14:29
  • @Divakar: ok, np, I will post it as edit to my original post as soon as I get back from the faculty. Yes, I tried gpuArrays, but it was much slower on my workstation, because it caused an immense amount of interrupts and DCPs under Windows (card is GTX 460). I don't know if it is a Windows problem, a driver problem, or something else. I am yet to try and run the CUDA-assisted code under Linux. – M.G. Aug 19 '14 at 14:34
  • Ok just to get started, can I assume `N = [1 2 3; 2 5 3], M = [1 2 3 4 5; 2 4 7 8 1]` and that the solutions to (1) would be `[1 2 3 4 5] (first row only from M)` and (2) would be `[1 0]`? I hope I am assuming the right stuffs here. – Divakar Aug 19 '14 at 14:37
  • Yes, that is correct. – M.G. Aug 19 '14 at 14:43
  • Curious - What typical sizes are we talking about for `M` and `N`? – Divakar Aug 19 '14 at 17:43
  • Typically, M has a few million rows and up to 10 columns, while N has a few thousands rows and up to (10-1) columns. – M.G. Aug 20 '14 at 12:10

3 Answers3

2

Approach #1

[val,ind] = max(bsxfun(@eq,permute(M,[4 2 1 3]),permute(N,[2 3 4 1])),[],2)
matches = squeeze(all(diff(ind,1)>0,1).*all(val,1))
out1 = any(matches,2) %// Solution - 1
out2 = sum(matches,1) %// Solution - 2

Approach #2

Another approach that avoids permuting N and might be better for longish N -

[val,ind] = max(bsxfun(@eq,N,permute(M,[3 4 1 2])),[],4)
matches = squeeze(all(diff(ind,[],2)>0,2).*all(val,2))
out1 = any(matches,1) %// Solution - 1
out2 = sum(matches,2) %// Solution - 2

Approach #3

Memory-scroogey approach for large datasizes -

out1 = false(size(M,1),1);  %// Storage for Solution - 1
out2 = zeros(size(N,1),1);  %// Storage for Solution - 2
for k=1:size(N,1)
    [val3,ind3] = max(bsxfun(@eq,N(k,:),permute(M,[1 3 2])),[],3);
    matches = all(diff(ind3,[],2)>0,2).*all(val3,2);
    out1 = or(out1,matches);
    out2(k) = sum(matches);
end

Approach #4

Memory-scroogey approach for GPU -

gM = gpuArray(M);
gN = gpuArray(N);

gout1 = false(size(gM,1),1,'gpuArray');  %// GPU Storage for Solution - 1
gout2 = zeros(size(gN,1),1,'gpuArray');  %// GPU Storage for Solution - 2
for k=1:size(gN,1)
    [val3,ind3] = max(bsxfun(@eq,gN(k,:),permute(gM,[1 3 2])),[],3);
    matches = all(diff(ind3,[],2)>0,2).*all(val3,2);
    gout1 = or(gout1,matches);
    gout2(k) = sum(matches);
end
out1 = gather(gout1);  %// Solution - 1
out2 = gather(gout2);  %// Solution - 2

Now, this GPU approach has blown away all other approaches. It was run with M : 320000X5 and N : 2100X3 (same as your input sizes) filled with random integer numbers. With a GTX 750 Ti, it took just 13.867873 seconds!! So if you have a GPU with sufficient memory, this might be your winner approach too.


Approach #5

Extremely-memory-scroogey approach for GPU -

gM = gpuArray(M);
gN = gpuArray(N);

gout1 = false(size(gM,1),1,'gpuArray');  %// GPU Storage for Solution - 1
gout2 = zeros(size(gN,1),1,'gpuArray');  %// GPU Storage for Solution - 2
for k=1:size(gN,1)
    [val2,ind2] = max(bsxfun(@eq,gM,permute(gN(k,:),[1 3 2])),[],2);
    matches = all(diff(ind2,[],3)>0,3).*all(val2,3);
    gout1 = or(gout1,matches);
    gout2(k) = sum(matches);
end
out1 = gather(gout1);  %// Solution - 1
out2 = gather(gout2);  %// Solution - 2
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Oh I see now (after much effort!). Very clever how `diff` automatically takes care of zeros. – Luis Mendo Aug 19 '14 at 15:57
  • @LuisMendo That `cols+diff` thing works as to select the progressive matches only. I started with `tril` and ended up with this, as the idea is "tril-ish", but the "diagonal" position varies for which I have `cols`. – Divakar Aug 19 '14 at 16:19
  • @Divakar: thank you a lot for your patience with me and for providing all these different approaches! It is much appreciated! I have decided to select your post as an answer, since it not only provided a loop-less solution, but eventually also provided a solution both faster and less memory-hungry (and using a loop on top of that!). I can only wish to master Matlab to the level you have! (continued) – M.G. Aug 20 '14 at 18:13
  • (continuation) Now I need to get an understanding of your code, in particular about the trick with permute(). Is there any explanation why Approach 3 is so much faster than Approach 2 despite looping through all sub-permutations of N? – M.G. Aug 20 '14 at 18:17
  • 1
    @July First off, it's been a pleasure working on this interesting problem as it's sorta rare thing on Stackoverflow! :) Regarding your question, approach #2 is a 100% brute-force memory-hungry approach and as such depends a lot on system memory bandwidth and therefore with large datasizes it gets slower and that's where approach #3 shines(which is still very much a brute force approach but lesser in degree than approach # 2). – Divakar Aug 20 '14 at 18:25
  • @July That `permute` is a nice trick to create singleton dimensions that lets `bsxfun` expand the elements onto those singleton dimensions creating a HUGE multi-dimensional array, so that brute-force techniques could be exploited there. I would advise you on checking out the doc on `bsxfun`. It’s an amazing tool for vectorization with MATLAB. – Divakar Aug 20 '14 at 18:27
  • @July Feel free to jump in with any queries on vectorization, would try my best. Finally few Stackoverflow solutions that seemed related and that might benefit you - [Removing four nested loops in Matlab](http://stackoverflow.com/questions/24039297/removing-four-nested-loops-in-matlab) [bsxfun implementation in matrix multiplication](http://stackoverflow.com/questions/23807960/bsxfun-implementation-in-matrix-multiplication) [Vectorizing nested loops in matlab using bsxfun and with GPU](http://stackoverflow.com/questions/25019207/vectorizing-nested-loops-in-matlab-using-bsxfun-and-with-gpu) – Divakar Aug 20 '14 at 18:32
  • @July Just to finish off on the discussion on why Approach #3 is looking better than #2 - Run the tests with M as `100x6` and N as `100x3`, you would see approach #2 winning against #3, but as you would increase the datasizes, #3 would start shining, the reason being memory bandwidth as discussed earlier in the comments. – Divakar Aug 20 '14 at 19:15
  • Thanks for adding Approach 5! It definitely works much better with large number of rows! The performance improvements from the GPU are really astonishing! Sorry if I am being a little dense at the moment, but is there a way to add an option to your code s.t. it computes for each row of M how many elements from N it contains (so to speak, problem 2 with reversed roles for N and M) ? Note that, trivially, it contains problem 1 as a special case. – M.G. Aug 21 '14 at 19:41
  • @July If I assume `M = [1 2 3 4 5; 2 4 7 8 1; 1 2 3 4 8;7 1 2 3 4;1 2 3 4 9;2 3 4 8 9], N = [1 2 3 4; 2 3 4 5]`, then output must be `[2 0 1 1 1 0]`? And for this new case you only need this and not a corresponding `problem 1`? – Divakar Aug 23 '14 at 12:01
  • @Divakar: Yes to both questions. I am just trying to see related problems, that I have been thinking about, implemented "properly" in Matlab in order to better understand the technique, so I can do similar problems later on on my own. If you want, I could also post it as a new question on its own if it's deemed sufficiently distinct and appropriate. – M.G. Aug 24 '14 at 05:06
  • @July And we are talking about 1 Mil and 5 Mil cases for M again? Because the methods that could be suggested would depend on the sizes. – Divakar Aug 24 '14 at 05:09
  • 5 Mil. But now that you mention it, how much further can one actually "optimize"? I mean, what if I had 10 or 20 or even 100 Mil? Is there a way to compute the memory limits and requirements? Sorry if I ask too many questions at once, but when I started two weeks ago, I did not expect to run so quickly into all these limitations. – M.G. Aug 24 '14 at 05:36
  • @July Ok before you start thinking of 100 Mil cases, can you actually create such big arrays in MATLAB? I think you should check on that first. I can barely reach upto `5 Mil x 6` case when I tried to create a random array of that size, not that I have a 32GB RAM rig or anything close to it though. – Divakar Aug 24 '14 at 05:38
  • I was talking rather hypothetically :) I realize, Matlab has an upper limit for arrays and so on, but apparently I do struggle with my 4 GB of RAM (yeah, I know, what was I thinking...), so I am curious in general how can one determine useful upper bounds. In reality, I have estimated that I won't need more than 14-15 Mil rows when it comes to the actual analysis at some point in the future, by which time I also hope to have mastered Matlab sufficiently well :) – M.G. Aug 24 '14 at 06:15
  • @July If you got time, use this to solve and maybe benchmark the latest problem that you were talking about earlier - `gout = zeros(size(gM,1),1,'gpuArray'); %// Storage for Solution for k=1:size(gN,1) [val,ind] = max(bsxfun(@eq,gM,permute(gN(k,:),[1 3 2])),[],2); gout = gout + all(diff(ind,[],3)>0,3).*all(val,3); end out = gather(gout);`. Use `gM` and `gN` from approach #5. Please remember this is made to handle big datasizes. – Divakar Aug 24 '14 at 06:24
  • @Divakar: Thanks a lot! I will try it and report back tomorrow. – M.G. Aug 26 '14 at 21:40
  • @Divakar: Sorry for responding so late! I really got carried away by work, students, deadlines last two weeks.. Anyway, I tested it and it works like a charm! Thank you! However, I noticed that it does not work properly if we allow the possibility for some entries to be identical in a row, i.e. if we allow (sub-)permutations with repetitions. Is there a way to modify the code to enable this? – M.G. Sep 09 '14 at 16:55
  • @July Well your question said `"N and M, whose rows are some (but not necessarily all possible) permutations of elements from S"`, so the solution(s) posted here assume unique entries in a row. If you have to allow repetitions in a row, that would change the problem drastically I am afraid and thus, the solution(s) to solve such a problem too. – Divakar Sep 09 '14 at 17:12
  • @Divakar: you are right, of course, I realize the original problem statement was quite different. I was simply experimenting and trying to come up with an easy modif. of your code for this new case, but I could not, hence the question. Time for a new thread :) – M.G. Sep 10 '14 at 04:57
  • @July Yeah you are on right course. I think starting a new thread would make more sense. – Divakar Sep 10 '14 at 05:06
  • @July BTW as I was coming up on these different solutions to the same problem and you might have noticed it too that different approaches result in different performance results and which also depend on the problem sizes. Looking back, I think it has good potential to form a good blog post. Not trying to give hopes to anyone or even myself on how enriching would that be in terms of knowledge sharing, but I could see its potential. I guess I need your approval on that. I don't have any plans to do such a thing anytime sooner, but just checking by you for now and hear your thoughts on that. – Divakar Sep 10 '14 at 05:09
  • @Divakar: OK, I started a new thread. About the blog: I think it is a great idea, and if you are really interested in that, I have several more combinatorial problems of similar nature that just beg for Matlab vectorization and GPU, and I could also generate data for various benchmarking and comparisons. – M.G. Sep 10 '14 at 08:23
  • @July So I was thinking that if you have more of such combinatorial problem(s) and input data ready, I could take a look in my freetime. Let me know if you do! – Divakar Sep 24 '14 at 20:34
  • @Divakar: Thanks for the kind offer! Indeed I do have more of these problems coming, but I am not quite there yet, so I don't have accurate formulations and I don't really have thought about them at all (now that I have a better understanding of `bsxfun()`, `permute()` and most of the other vectorization functions), nor have I any (large) input data ready at the moment so as to test performance. On the other hand, I guess it would not be too difficult to come up with some more or less common combinatorial problems that I will most certainly meet at some point. Let me know what you think. – M.G. Oct 05 '14 at 04:39
  • @July Sure whatever whenever comes your way, let me or us(Stackoverflow) know (in case I am not around as other people around here are knowledgeable about vectorization too) . It's great to know that you have a better grasp of those functions which I think are key tools to vectorization. Good luck! – Divakar Oct 05 '14 at 06:25
2

How about this?

n = size(N,1);
m = size(M,1);
p = size(N,2);
pattern = (1:p).'; %'// will be used for checking if it's a subpermutation or not
result = false(m,n); %// preallocate result, and initiallize to 0
for k = 1:size(N,1) %// loop over columns of (transposed) N
    [~, loc] = ismember(M, N(k,:)); %// entries of M equal to a value of N(:,k)
    ind = find(sum(loc>0,2)==p); %// we need p matches per row of M
    t = reshape(nonzeros(loc(ind,:).'),p,[]); %'// for those rows, take matches
    ind2 = all(bsxfun(@eq,t,pattern)); %// ... see if they are in the right order
    result(ind(ind2),k) = true; %// ... and if so, take note in result matrix
end

The result matrix contains 1 at position r,s if the s-th row N is a sub-permutation of the r-th row of M. From this, your desired results are

result1 = any(result,2);
result2 = sum(result,1);

Example:

M =
     8     9     4     1    10
     6     5     2     7     8
     4     1     9     2    10
     3     4     5     1     2

N =
     4     1     2
     4     9    10
     3     5     9

give

result =
     0     0     0
     0     0     0
     1     1     0
     1     0     0

result1 =
     0
     0
     1
     1

result2 =
     2     1     0
Luis Mendo
  • 110,752
  • 13
  • 76
  • 147
  • I was hoping to avoid a for-looping through the sub-permutations of N. I tried it and got out-of-memory problems with it too, though not as bad as with Divakar's solution. I am curious to bench your code vs. mine, though :) Also, mine is not that memory-hungry, I have not run into memory issues with it so far in my tests. – M.G. Aug 20 '14 at 12:17
  • The usual memory-time tradeoff, I guess. If you do some benchmarking, you could post it as an answer; it would be interesting – Luis Mendo Aug 20 '14 at 12:23
  • I wish I had more RAM :( OK, I will try and post some benchmarks with suitable matrices. – M.G. Aug 20 '14 at 14:27
1

I benchmarked all the approaches against different pairs of matrices N,M, and wherever possible, I have also compared parfor vs. for and picked the faster one. Here are my results:

    %//Test 1: size(N) = 2263x3, size(M) = 5000x6

    %//My approach (parfor): 0.650626 sec
    %//Divakar's Approach 1: 1.870144 sec
    %//Divakar's Approach 2: 1.164088 sec
    %//Divakar's Approach 3: 0.380915 sec (with parfor)
    %//Divakar's Approach 4: 2.643659 sec (gpu)
    %//Luis Mendo's Approach: 1.681007 sec


    %//Test 2: size(N) = 2263x3, size(M) = 25000x6

    %//My approach (parfor): 6.137823 sec
    %//Divakar's Approach 1: 8.342699 sec
    %//Divakar's Approach 2: 5.784426 sec
    %//Divakar's Approach 3: 2.251888 sec (with parfor)
    %//Divakar's Approach 4: 6.272578 sec (gpu)
    %//Luis Mendo's Approach: 11.514548 sec

    %//Test 3: size(N) = 2100x3, size(M) = 20000x5

    %//My approach (parfor): 3.417432 sec
    %//Divakar's Approach 1: 5.732680 sec
    %//Divakar's Approach 2: 4.107909 sec
    %//Divakar's Approach 3: 1.393052 sec (with parfor)
    %//Divakar's Approach 4: 3.145183 sec (gpu)
    %//Luis Mendo's Approach: 5.668326 sec

    %//Test 4: size(N) = 2100x3, size(M) = 324632x5

    %//Divakar's Approach 3: 54.231878 sec (with parfor)
    %//Divakar's Approach 4: 15.111936 sec (gpu) 

    %//Test 5: size(N) = 2263x3, size(M) = 1000000x6

    %//Divakar's Approach 3: 210.853515 sec (with parfor)
    %//Divakar's Approach 4: 49.529794 sec (gpu) 
    %//Divakar's Approach 5: 49.874444 sec (gpu)

    %//Test 6: size(N) = 2263x3, size(M) = 5000000x6

    %//Divakar's Approach 3: 1137.606244 sec (with parfor)
    %//Divakar's Approach 4: stopped it after 15 min and heavy interrupts/DCPs activity
    %//Divakar's Approach 5: 267.169307 sec

Among the non-gpu approaches, Divakar's Approach 3 was by far the fastest one. Its gpu counterpart started showing its advantages only with large number of rows.

M.G.
  • 293
  • 1
  • 13
  • So, it's not bad I guess going from 30 minutes to 49 secs, huh? :) – Divakar Aug 20 '14 at 18:34
  • Not bad at all :) However, I ran into a small problem. When I up the number of rows of M from 1 Mil to, say, 5 Mil, I suddenly start getting lots of interrupts/DCPs activity in Windows and the gpu-assisted code becomes much slower than its non-gpu counterpart. Something similar is observed when the number of rows of M is too small like in Tests 1-3. My gpu is a GTX 460 with 1 GB of VRAM. Is this something related to the gpu memory limitation or failure of cpu to keep up with the gpu or something else? I could divide M into pieces, but I'd rather avoid that for convenience reasons :) – M.G. Aug 21 '14 at 01:43
  • The reason why #1,2, 3 with GPU seem slower AFAIK (from my CUDA knowledge), there is a good overhead of CUDA context creation overhead + GPU memory allocation + CPU to GPU data transfer. Therefore use GPU only when you have lots of calculations to perform, which is the case with larger datasizes. To get the actual time spent by GPU calculations time it without `gM = gpuArray(M); gN = gpuArray(N);` – Divakar Aug 21 '14 at 05:18
  • If you want to keep on increasing M, you need to read each row of M instead and as such the entire code would change for a fair bit and yes it would get a lot more complicated, because then you would be needed to store intermediate results. – Divakar Aug 21 '14 at 05:24
  • Added approach #5 which must be better than #4 for extremely large datasizes. To answer your question of gpu code geting slower than non-gpu code for 5 Mil, the reason behind that has to be memory-bandwidth. So for that case as said earlier too - Either restructure your code entirely to read one row at a time from `M` (lot more inconvenient and complicated and doesn't guarantee a speedup) or I guess use a better GPU :) – Divakar Aug 21 '14 at 06:10
  • Thanks for the clarifications once again! I guess I do need a better GPU, don't I :) Say, if I add a 2nd CUDA-capable GPU, is it a simple task for Matlab to utilize both at the same time (e.g. if I make a SLI) or it becomes rather involved to utilize both of them efficiently? – M.G. Aug 21 '14 at 19:34
  • Just a quick comment on the 5 mil case, I was able to try it on GTX 750Ti with 2GB VRAM and it ran in 150 something secs, which still has to be better than `parfor implementation` I am guessing . Could you report the runtimes at your end with 5 mil case for maybe approaches #4,5? – Divakar Aug 21 '14 at 21:08
  • @Divakar: Done. I have updated Test 5 since it can provide comparison between your last two gpu approaches. I ran them a few times and they were always very close to each other, sometimes the one being slightly faster, sometimes the other way around. So, I guess, with 1 Mil rows it does not matter that much. In Test 6 (5 Mil rows), I had to manually interrupt Approach 4 after 15 min or so, it was obvious it was not coping well and it was not going anywhere. Approach 5 performed very nicely as expected. – M.G. Aug 22 '14 at 13:45
  • If it's not too troubling, could you add the runtime with parfor implementation for Test 6? I think you said you ran it and was faster than GPU (maybe you meant approach #4)? – Divakar Aug 22 '14 at 13:52
  • @Divakar: Ah, sorry, yes, added now. Sorry it took me a while, I had to run it once again as I did not remember the exact number. There is a new entry now under Test 6. – M.G. Aug 22 '14 at 15:09