9

I'm trying to find a fastest way for finding unique values in a array and to remove 0 as a possibility of unique value.

Right now I have two solutions:

result1 = setxor(0, dataArray(1:end,1)); % This gives the correct solution
result2 = unique(dataArray(1:end,1)); % This solution is faster but doesn't give the same result as result1

dataArray is equivalent to :

dataArray = [0 0; 0 2; 0 4; 0 6; 1 0; 1 2; 1 4; 1 6; 2 0; 2 2; 2 4; 2 6]; % This is a small array, but in my case there are usually over 10 000 lines.

So in this case, result1 is equal to [1; 2] and result2 is equal to [0; 1; 2]. The unique function is faster but I don't want 0 to be considered. Is there a way to do this with unique and not consider 0 as a unique value? Is there an another alternative?

EDIT

I wanted to time the various solutions.

clc
dataArray = floor(10*rand(10e3,10));
dataArray(mod(dataArray(:,1),3)==0)=0;
% Initial
tic
for ii = 1:10000
   FCT1 = setxor(0, dataArray(:,1));
end
toc
% My solution
tic
for ii = 1:10000
   FCT2 = unique(dataArray(dataArray(:,1)>0,1));
end
toc
% Pursuit solution
tic
for ii = 1:10000
   FCT3 = unique(dataArray(:, 1));
   FCT3(FCT3==0) = [];
end
toc
% Pursuit solution with chappjc comment
tic
for ii = 1:10000
   FCT32 = unique(dataArray(:, 1));
   FCT32 = FCT32(FCT32~=0);
end
toc
% chappjc solution
tic
for ii = 1:10000
   FCT4 = setdiff(unique(dataArray(:,1)),0);
end
toc
% chappjc 2nd solution
tic
for ii = 1:10000
   FCT5 = find(accumarray(dataArray(:,1)+1,1))-1;
   FCT5 = FCT5(FCT5>0);
end
toc

And the results:

Elapsed time is 5.153571 seconds. % FCT1 Initial
Elapsed time is 3.837637 seconds. % FCT2 My solution
Elapsed time is 3.464652 seconds. % FCT3 Pursuit solution
Elapsed time is 3.414338 seconds. % FCT32 Pursuit solution with chappjc comment
Elapsed time is 4.097164 seconds. % FCT4 chappjc solution
Elapsed time is 0.936623 seconds. % FCT5 chappjc 2nd solution

However, the solution with sparse and accumarray only works with integer. These solutions won't work with double.

m_power
  • 3,156
  • 5
  • 33
  • 54
  • @chappjc you are right, the `'rows'` is not necessary. I removed it. – m_power Dec 18 '13 at 20:44
  • You might want to try the timings for longer `dataArray` rather than more iterations. But just for kicks, throw in `FCT3 = FCT3(FCT3~=0);` since that is usually faster than assigning `[]`;. – chappjc Dec 18 '13 at 20:48
  • Agree with @chappjc - your timing is dominated by having either one or two function calls. Try this on a "big" array (see my answer for an example). You are mostly timing the loop (and function call) overhead with your current code, rather than the efficiency of the functions once you are inside them. – Floris Dec 18 '13 at 20:54
  • 1
    @chappjc 2nd solution appears to be the fastest, for now :D. – m_power Dec 18 '13 at 21:20
  • @m_power Have a look at Floris' answers, and also his test data commands which remove some elements for a more robust test. If you want, you can also do all 10 columns in one shot with `rand(10e3,10)`. – chappjc Dec 18 '13 at 21:23
  • @chappjc I made the modifications. There are no changes to the final results. – m_power Dec 18 '13 at 21:31
  • @m_power As I mentioned in my comments below, there are challenges with floating point numbers even with `unique`. See [here](http://stackoverflow.com/questions/1988535/return-unique-element-with-a-tolerance) and [here](http://www.mathworks.com/matlabcentral/newsreader/view_thread/12406). Consider if you can operate with integral-valued indexes rather than floating point values. – chappjc Dec 18 '13 at 21:52
  • 1
    @chappjc inspired by your last comment I updated my answer one more time... – Floris Dec 18 '13 at 22:31

4 Answers4

6

Here's a wacky suggestion with accumarray, demonstrated using Floris' test data:

a = floor(10*rand(100000, 1)); a(mod(a,3)==0)=0;
result = find(accumarray(nonzeros(a(:,1))+1,1))-1;

Thanks to Luis Mendo for pointing out that with nonzeros, it is not necessary to perform result = result(result>0)!

Note that this solution requires integer-valued data (not necessarily an integer data type, but just not with decimal components). Comparing floating point values for equality, as unique would do, is perilous. See here and here.


Original suggestion: Combine unique with setdiff:

result = setdiff(unique(a(:,1)),0)

Or remove with logical indexing after unique:

result = unique(a(:,1));
result = result(result>0);

I generally prefer not to assign [] as in (result(result==0)=[];) since it gets very inefficient for large data sets.

Removing zeros after unique should be faster since the it operates on less data (unless every element is unique, OR if a/dataArray is very short).

Community
  • 1
  • 1
chappjc
  • 30,359
  • 6
  • 75
  • 132
  • When using `a` 10000x10 I get the following error : ??? Error using ==> accumarray Maximum variable size allowed by the program is exceeded. – m_power Dec 18 '13 at 21:15
  • 1
    @m_power Needs to use just the first column (`a(:,1)`) like the other solutions. Updated. – chappjc Dec 18 '13 at 21:16
  • 2
    Wacky, maybe. But FAST! – Floris Dec 18 '13 at 21:28
  • What if there are `double` in the `a` array? I get the error `First input SUBS must contain positive integer subscripts.`. – m_power Dec 18 '13 at 21:41
  • @m_power Right, this and the `sparse` solutions only work for integer-valued data, but you generally don't want to compare floating point values for equality (as with `unique` too). I should say, you _really_ shouldn't. – chappjc Dec 18 '13 at 21:42
  • @chappjc The fact is there are `double` in my array, I didn't thought it would cause a problem in my initial example. – m_power Dec 18 '13 at 21:46
  • 2
    @m_power I'd reconsider the problem based on what the expected values are. Are they continuous or will they be certain values. There are [problems using unique on double as well](http://www.mathworks.com/matlabcentral/newsreader/view_thread/12406) although it will let you run it. Beware. – chappjc Dec 18 '13 at 21:48
  • 1
    @chappjc +1, but why not remove the zero elements _before_ `accumarray`?: `result = find(accumarray(nonzeros(a(:,1))+1,1)-1);`. It takes slighlty less time on my computer. – Luis Mendo Jan 01 '14 at 04:19
5

Just to add to the general clamor - here are three different methods. They all give the same answer, but slightly different timings:

a = floor(10*rand(100000, 1));
a(mod(a,3)==0)=0;
tic
b1 = unique(a(:,1));
b1(b1==0) = [];
toc
tic
b2 = find(sparse(a(:,1)+1, 1, 1)) - 1;
b2(b2==0)=[];
toc
tic
b3 = setxor(0, a(:, 1), 'rows');
toc

display(b1)
display(b2)
display(b3)

On my machine, the timings (for an array of 100000 elements) were as follows:

0.0087 s  - for unique
0.0142 s  - for find(sparse)
0.0302 s  = for setxor

I always like sparse for a problem like this - you get the count of elements at the same time as their unique values.

EDIT per @chappj's suggestion. I added a fourth option

b4 = find(accumarray(a(:,1)+1,1)-1);
b4(b4==0) = [];

Time:

0.0029 s , THREE TIMES FASTER THAN UNIQUE

Ladies and gentlemen, we have a winner.

AFTERWORD the index-based methods (sparse and accumarray) only work with integer-valued inputs (although they can be of double type). This seemed OK based on the input array given in the question, but doesn't work for non-integer valued inputs. Of course, unique is a tricky concept when you have doubles - number that "look" the same may be represented differently. You might consider truncating the input array (sanitizing the data) to make sure this is not a problem. For example, if you did

a = 0.001 * double(int(a * 1000));

You would round all values to no more than 3 significant figures, and because you went "via an int" you are sure that you don't end up with values that are "very subtly different" (say in the 8th digit or beyond). Of course in that case you could also do

a = round(a * 1000);
mina = min(a(:));
b = find(accumarray(a - mina + 1, 1)) + mina - 1;
b = 0.001 * b(b ~= 0);

This is "fairly robust" for non-integer values (in the above case it handles values with up to three significant digits; if you need more, the space requirements will eventually get too large and this method will be slower than unique, which in fact has to sort the data.)

Floris
  • 45,857
  • 6
  • 70
  • 122
  • 1
    I prefer `accumarray` to `sparse`. Wanna add the timings. ;) – chappjc Dec 18 '13 at 21:06
  • @chappjc that was a terrific suggestion. It is 3x faster (on my machine)! – Floris Dec 18 '13 at 21:28
  • 1
    Thank you, thank you. But it's likely only a matter of time until a faster one comes along. ;) – chappjc Dec 18 '13 at 21:28
  • Your third solution is working correctly, however I can't use it for what I'm doing. I need to have same exact values, not rounded values. – m_power Dec 19 '13 at 16:06
  • 1
    @m_power Could you post some actual data on dropbox or somewhere similar. I'm skeptical that you will get a satisfactory solution with anything that does not use tolerances, including `unique`. Testing floating point values for equality rarely works as desired. – chappjc Dec 19 '13 at 16:44
  • @Floris As I just commented in chappjc's answer: why not remove the zero elements _initially_? In the `accumarray` version, that seems to save a little time (on my machine): `b5 = find(accumarray(nonzeros(a(:,1))+1,1)-1)` – Luis Mendo Jan 01 '14 at 04:23
3

Why not remove the zeros as a second step:

result2 = unique(.....);
result2 = (result2~=0);
Pursuit
  • 12,285
  • 1
  • 25
  • 41
  • `result2 = result2(result2 ~= 0)` might be faster: http://stackoverflow.com/questions/12421345/deleting-matrix-elements-by-vs-reassigning-matrix – Dan Dec 19 '13 at 07:57
0

I also found another way to do it :

result2 = unique(dataArray(dataArray(:,1)>0,1));
m_power
  • 3,156
  • 5
  • 33
  • 54
  • 1
    Just a tip: `1:end` is the same as just `:`. But, yeah, this is a workable solution. – chappjc Dec 18 '13 at 20:44
  • 1
    Strictly speaking you should use `dataArray(:,1)~=0` to allow for the possibility of negative numbers. I suspect thought that it's faster to remove the extra zero at the end (smaller array to manipulate). – Floris Dec 18 '13 at 22:56