8

The uniquetol function, introduced in R2015a, computes "unique elements within tolerance". Specifically,

C = uniquetol(A,tol) returns the unique elements in A using tolerance tol.

But the problem of finding unique elements with a given tolerance has several solutions. Which one is actually produced?

Let's see two examples:

  1. Let A = [3 5 7 9] with absolute tolerance 2.5. The output can be [3 7], or it can be [5 9]. Both solutions satisfy the requirement.

  2. For A = [1 3 5 7 9] with absolute tolerance 2.5, the output can be [1 5 9] or [3 7]. So even the number of elements in the output can vary.

See this nice discussion about the transitivity issue that lies at the heart of the problem.

So, how does uniquetol work? What output does it produce among the several existing solutions?

Luis Mendo
  • 110,752
  • 13
  • 76
  • 147
  • Why `reverse-engineering` tag? – arrowd Sep 07 '17 at 16:54
  • 1
    @arrowd because he reverse-engineered the fucntion – Ander Biguri Sep 07 '17 at 16:54
  • @arrowd To be honest, I'm not totally sure it applies here. I think it does, as the description says _Reverse engineering is the process of discovering the technological principles of a human made [...] object or system through analysis of its [...] function and operation._ – Luis Mendo Sep 07 '17 at 16:56

1 Answers1

8

To simplify, I consider the one-output, two-input version of uniquetol,

 C = uniquetol(A, tol);

where the first input is a double vector A. In particular, this implies that:

  • The 'ByRows' option of uniquetol is not used.
  • The first input is a vector. If it were not, uniquetol would implicitly linearize to a column, as usual.
  • The second input, which defines the tolerance, is interpreted as follows:

    Two values, u and v, are within tolerance if abs(u-v) <= tol*max(abs(A(:)))

    That is, the specified tolerance is relative by default. The actual tolerance used in the comparisons is obtained by scaling by the maximum absolute value in A.

With these considerations, it seems that the approach that uniquetol uses is:

  1. Sort A.
  2. Pick the first entry of sorted A, and set this as reference value (this value will have to be updated later).
  3. Write the reference value into the output C.
  4. Skip subsequent entries of sorted A until one is found that is not within tolerance of the reference value. When that entry is found, take it as the new reference value and go back to step 3.

Of course, I'm not saying that this is what uniquetol internally does. But the output seems to be the same. So this is functionally equivalent to what uniquetol does.

The following code implements the approach described above (inefficient code, just to illustrate the point).

% Inputs A, tol
% Output C
tol_scaled = tol*max(abs(A(:))); % scale tolerance
C = []; % initiallize output. Will be extended
ref = NaN; % initiallize reference value to NaN. This will immediately cause
           % A(1) to become the new reference
for a = sort(A(:)).';
    if ~(a-ref <= tol_scaled)
        ref = a;
        C(end+1) = ref;
    end
end

To verify this, let's generate some random data and compare the output of uniquetol and of the above code:

clear
N = 1e3; % number of realizations
S = 1e5; % maximum input size
for n = 1:N;
    % Generate inputs:
    s = randi(S); % input size
    A = (2*rand(1,S)-1) / rand; % random input of length S; positive and 
                                % negative values; random scaling
    tol = .1*rand; % random tolerance (relative). Change value .1 as desired

    % Compute output:
    tol_scaled = tol*max(abs(A(:))); % scale tolerance
    C = []; % initiallize output. Will be extended
    ref = NaN; % initiallize reference value to NaN. This will immediately cause
               % A(1) to become the new reference 
    for a = sort(A(:)).';
        if ~(a-ref <= tol_scaled)
            ref = a;
            C(end+1) = ref;
        end
    end

    % Check if output is equal to that of uniquetol:
    assert(isequal(C, uniquetol(A, tol)))
end

In all my tests this has run without the assertion failing.

So, in summary, uniquetol seems to sort the input, pick its first entry, and keep skipping entries for as long as it can.

For the two examples in the question, the outputs are as follows. Note that the second input is specified as 2.5/9, where 9 is the maximum of the first input, to achieve an absolute tolerance of 2.5:

>> uniquetol([1 3 5 7 9], 2.5/9)
ans =
     1     5     9
>> uniquetol([3 5 7 9], 2.5/9)
ans =
     3     7
Luis Mendo
  • 110,752
  • 13
  • 76
  • 147
  • 1
    So, for the examples in the question, which is the output? (+1) – Ander Biguri Sep 07 '17 at 16:54
  • 1
    @AnderBiguri Good point, I'll include that. It basically picks the first value it finds, then skips values until it can skip no more. So: `uniquetol([3 5 7 9], 2.5/9)` gives `[3 7]`; `uniquetol([1 3 5 7 9], 2.5/9)` gives `[1 5 9]`. Note the required `/9` so `2.5` is _absolute_ tolerance – Luis Mendo Sep 07 '17 at 16:59
  • Interesting! I would instead output the mean or the median of each set of equivalent values, which I think would be fairer somehow. – Cris Luengo Aug 17 '19 at 18:30
  • @CrisLuengo But would you still define _a set of equivalent values_ as is done by `uniquetol`? The seems to be a large degree of arbitrariness in the way it is done. (And thanks for the votes!) – Luis Mendo Aug 17 '19 at 18:53
  • Maybe an iterative process could lead to a more stable result? – Cris Luengo Aug 17 '19 at 20:35