Fastest way of finding repeated values in different cell arrays of different size

Question

The problem is the following:

I have a cell array of the form indx{jj} where each jj is an array of 1xNjj, meaning they all have different size. In my case max(jj)==3, but lets consider a general case for the shake of it.

How would you find the value(s) repeated in all the jj i the fastest way?

I can guess how to do it with several for loops, but is there a "one (three?) liner"?

Simple example:

indx{1}=[ 1 3 5 7 9];
indx{2}=[ 2 3 4 1];
indx{3}=[ 1 2 5 3 3 5 4];


ans=[1 3];

score 4 · Answer 1 · answered Sep 15 '14 at 15:27

4

One possibility is to use a for loop with intersect:

result = indx{1}; %// will be changed
for n = 2:numel(indx)
    result = intersect(result, indx{n});
end

answered Sep 15 '14 at 15:27

Luis Mendo

110,752
13
76
147

Fantastic, great answer, as usual ;) – Ander Biguri Sep 15 '14 at 15:34
1

@AnderBiguri Thanks! :-) I was trying to avoid the loop, but if a cell can have repeated elements I don't see how – Luis Mendo Sep 15 '14 at 15:35
Sorry, but it seems that @rayryeng's answer is faster! Check a comment there to see benchmark! But, just wow guys! thanks a lot for the effort! – Ander Biguri Sep 15 '14 at 16:17
1

@AnderBiguri Haha, good competition here, motivated by a good question – Luis Mendo Sep 15 '14 at 16:23
2

@LuisMendo - Don't worry Luis. It turns out that Divakar's method is faster (as I suspected)... just Ander decided to play a trick on me and got my hopes up ;) hehe. – rayryeng Sep 15 '14 at 16:53
2

@AnderBiguri - This was a good question to ask. A little competition doesn't hurt! – rayryeng Sep 15 '14 at 16:53
So in continuation with our multidimensional mathematical operations thing, I just came across this - http://stackoverflow.com/questions/24508751/summing-a-4d-matrix-without-for-loops-in-matlab. Take a look! – Divakar Sep 15 '14 at 17:54
@Divakar Again, `sum` seems to be faster when it works along lowest dimensions, right? – Luis Mendo Sep 15 '14 at 17:58
@LuisMendo Okay so shai commented in there this - `"performance-wise, it is better to sum over trailing dimension (i.e., third and forth) rather than the leading dimensions (i.e., first and second)."`. What do you think about it? I would think the point is to avoid "squeeze". In any case, we really need to perform some benchmarks to make it full-proof. – Divakar Sep 15 '14 at 18:03
@Divakar Oh, I had only looked at the answers and their comments. But our (my?) conclusions the other day were that it's faster along leading dimensions, not along trailing dimensions, if I recall correctly. Yes, it could be caused by `squeeze`. This would make a good "canonical" Q&A :-) – Luis Mendo Sep 15 '14 at 18:05
@LuisMendo If you are okay with keeping the summed output as the input multidimensional array size, I am guessing summing along the lower dims would be better, but in most cases we want it to be "squeezed" in and `squeeze` could slow things abit. Well all of these could really form some good canonical/reference stuff. – Divakar Sep 15 '14 at 18:11
@LuisMendo - I've got a MATLAB question for you if you don't mind answering :). I'd like your input. This is the first time I think I've asked a serious question where I'm stumped! Check here: http://stackoverflow.com/questions/25877835/efficiently-compute-a-3d-matrix-of-outer-products-matlab – rayryeng Sep 16 '14 at 20:34
@rayryeng Looks like Divakar arrived first :-) I'll take a look anyway in case I can add a different view – Luis Mendo Sep 16 '14 at 21:49

Divakar · Accepted Answer · 2014-10-08T13:26:45.280

3

Almost no-loop approach (almost because cellfun essentially uses loop(s) inside it, but it's effect here is minimal as we are using it to find just the number of elements in each cell) -

lens = cellfun(@numel,indx);

val_ind = bsxfun(@ge,lens,[1:max(lens)]');
vals = horzcat(indx{:});

mat1(max(lens),numel(lens))=0;
mat1(val_ind) = vals;

unqvals = unique(vals);
out = unqvals(all(any(bsxfun(@eq,mat1,permute(unqvals,[1 3 2]))),2));

edited Oct 08 '14 at 13:26

answered Sep 15 '14 at 15:40

Divakar

218,885
19
262
358

and... Is it really faster? – Ander Biguri Sep 15 '14 at 15:42
+1 I like the approach. Just a quibble: `cellfun` is more or less a loop – Luis Mendo Sep 15 '14 at 15:45
@LuisMendo I would assume that to be of minimal work here, I guess :) I am just a bit worried about the bsxfun going to 3D. – Divakar Sep 15 '14 at 15:45
@AnderBiguri Do you have any large data to perform benchmarks? – Divakar Sep 15 '14 at 15:46
This approach looks promising from the point of view of speed. `intersect` doesn't seem to be fast – Luis Mendo Sep 15 '14 at 15:47
The thing is: Generally I dont have big data, but I need to use it A LOT of times. I am trying all your approaches, to see who wins! – Ander Biguri Sep 15 '14 at 15:49
1

@AnderBiguri May sound silly , but just to be sure, make sure to supress the output out here and for other solutions too. Made a slight edit to calculate `lens` and more changes at the end :) – Divakar Sep 15 '14 at 15:54
@Divakar +1 from me too. Also, not trying to bug, but was wondering if you happened to take a look at that script yet :). No pressure... just wondering what your thoughts were so far! – rayryeng Sep 15 '14 at 16:05
Sorry to say, but your approach fails sometimes. Still trying to see why. ex: indx{1}=[47388 327602 18532 197333 327757 10453 95320 320 50873 189793 189795]'; indx{2}=[6044 150409 153084 327602 10453 73164 73167 73170 73171 95320 113865 189812 190421 197333 197334 327185]';indx{3}=[18379 47388 50816 197344 327757 10027 10035 10040 95327 197316 327602]'; – Ander Biguri Sep 15 '14 at 16:06
1

@rayryeng Ah No I didn't! ;) Just being lazy about it I guess, but will soon , hopefully ;) – Divakar Sep 15 '14 at 16:06
@AnderBiguri let me try that out. – Divakar Sep 15 '14 at 16:15
@AnderBiguri I had few issues copying and pasting that data, not sure what it was. So, this code expects each cell to have `1xNjj` data as pointed out in the question. So I had to modify the data accordingly, which you can try too - `indx{1}=[47388 327602 18532 197333 327757 10453 95320 320 50873 189793 189795]; indx{2} = [6044 150409 153084 327602 10453 73164 73167 73170 73171 95320 113865 189812 190421 197333 197334 327185] indx{3}=[18379 47388 50816 197344 327757 10027 10035 10040 95327 197316 327602]` – Divakar Sep 15 '14 at 16:29
1

@AnderBiguri Could you perform the benchmarks with this data again? Sorry to bother :) – Divakar Sep 15 '14 at 16:33
I retried. And well, here you are kicking asses! but a couple of things needed to be changed for a looping case. 1: delete mat1. mat1=[] in thebegining, else sometimes it will have previous values. 2: in case lens is 0, add an if not to continue the code and avoid exteptions. but I guess thay are problem especific improvements. BTW, avg: 2.4e-4. Congrats! – Ander Biguri Sep 15 '14 at 16:47
@AnderBiguri That first one was a just a better way to pre-allocate that I got from [here](http://undocumentedmatlab.com/blog/preallocation-performance), but yeah one can replace that with `= zeros(max(lens),numel(lens))` for a not so crazy pre-allocation. Agreed about the second point too! :) Was a good question BTW! – Divakar Sep 15 '14 at 16:50
@Divakar - I've got a MATLAB question for you if you don't mind answering :). This is part of that efficiency stuff I was talking to you about. I've made some progress, but I'm experiencing a bottleneck. Check here: http://stackoverflow.com/questions/25877835/efficiently-compute-a-3d-matrix-of-outer-products-matlab – rayryeng Sep 16 '14 at 20:33

rayryeng · Answer 3 · 2014-09-15T15:59:20.633

3

Another possibility that I could suggest, though Luis Mendo's answer is very good, is to take all of the vectors in your cell array and remove the duplicates. This can be done through cellfun, and specifying unique as the function to operate on. You'd have to set the UniformOutput flag to false as we are outputting a cell array at each index. You also have to be careful in that each cell array is assumed to be all row vectors, or all column vectors. You can't mix the way the arrays are shaped or this method won't work.

Once you do this, concatenate all of the vectors into a single array through cell2mat, then do a histogram through histc. You'd specify the edges to be all of the unique numbers in the single array created before. Note that you'd have to make an additional call to unique on the output single array before proceeding. Once you calculate the histogram, for any entries with a bin count equal to the total number of elements in your cell array (which is 3 in your case), then these are values that you see in all of your cells. As such:

A = cell2mat(cellfun(@unique, indx, 'uni', 0));
edge_values = unique(A);
h = histc(A, edge_values);
result = edge_values(h == numel(indx));

With the unique call for each cell array, if a number appears in every single cell, then the total number of times you see this number should equal the total number of cells you have.

edited Sep 15 '14 at 15:59

answered Sep 15 '14 at 15:41

rayryeng

102,964
22
184
193

+1 I was going to type something similar as an edit to my answer – Luis Mendo Sep 15 '14 at 15:43
2

@LuisMendo - Thanks! :) I wanted to add a no `for` loop approach here as well, though Divakar has already done that... being the `bsxfun` advocate that he is! – rayryeng Sep 15 '14 at 15:44
2

Who could _not_ be a `bsxfun` advocate? :-) – Luis Mendo Sep 15 '14 at 15:44
1

@LuisMendo - I slightly changed my answer to make it more robust, as my previous answer assumed that indices started from `1`. Also, the previous approach assumed that the values were integer. The modified code now assumes that the numbers can be any number, and can start anywhere on the real set of numbers. – rayryeng Sep 15 '14 at 16:02
2

@rayryeng Nice to see the big guys compete for the best solution for the shake of SCIENCE! You are winning in speed! I run the code for Njj of around 30 each 1000 times. Your average =3.7e-4 Luis's average =5.6e-4 – Ander Biguri Sep 15 '14 at 16:16

Fastest way of finding repeated values in different cell arrays of different size

3 Answers3

Linked