1

What I'm trying to do

I have an array of numbers:

>> A = [2 2 2 2 1 3 4 4];

And I want to find the array indices where each number can be found:

>> B = arrayfun(@(x) {find(A==x)}, 1:4);

In other words, this B should tell me:

>> for ii=1:4, fprintf('Item %d in location %s\n',ii,num2str(B{ii})); end
Item 1 in location 5
Item 2 in location 1  2  3  4
Item 3 in location 6
Item 4 in location 7  8

It's like the 2nd output argument of unique, but instead of the first (or last) occurrence, I want all the occurrences. I think this is called a reverse lookup (where the original key is the array index), but please correct me if I'm wrong.

How can I do it faster?

What I have above gives the correct answer, but it scales terribly with the number of unique values. For a real problem (where A has 10M elements with 100k unique values), even this stupid for loop is 100x faster:

>> B = cell(max(A),1);
>> for ii=1:numel(A), B{A(ii)}(end+1)=ii; end

But I feel like this can't possibly be the best way to do it.

We can assume that A contains only integers from 1 to the max (because if it doesn't, I can always pass it through unique to make it so).

Community
  • 1
  • 1
KQS
  • 1,547
  • 10
  • 21

2 Answers2

3

That's a simple task for accumarray:

out = accumarray(A(:),(1:numel(A)).',[],@(x) {x})  %'

out{1} = 5
out{2} = 3 4 2 1
out{3} = 6
out{4} = 8 7  

However accumarray suffers from not being stable (in the sense of unique's feature), so you might want to have a look here for a stable version of accumarray, if that's a problem.


Above solution also assumes A to be filled with integers, preferably with no gaps in between. If that is not the case, there is no way around a call of unique in advance:

A = [2.1 2.1 2.1 2.1 1.1 3.1 4.1 4.1];

[~,~,subs] = unique(A)
out = accumarray(subs(:),(1:numel(A)).',[],@(x) {x})

To sum up, the most generic solution, working with floats and returning a sorted output could be:

[~,~,subs] = unique(A)
[subs(:,end:-1:1), I] = sortrows(subs(:,end:-1:1));  %// optional
vals = 1:numel(A);                                   
vals = vals(I);                                      %// optional
out = accumarray(subs, vals , [],@(x) {x});

out{1} = 5
out{2} = 1 2 3 4
out{3} = 6
out{4} = 7 8  

Benchmark

function [t] = bench()
    %// data
    a = rand(100);
    b = repmat(a,100);
    A = b(randperm(10000));

    %// functions to compare
    fcns = {
        @() thewaywewalk(A(:).');
        @() cst(A(:).');
    }; 

    % timeit
    t = zeros(2,1);
    for ii = 1:100;
        t = t + cellfun(@timeit, fcns);
    end
    format long
end

function out = thewaywewalk(A) 
    [~,~,subs] = unique(A);
    [subs(:,end:-1:1), I] = sortrows(subs(:,end:-1:1));
    idx = 1:numel(A);
    out = accumarray(subs, idx(I), [],@(x) {x});
end
function out = cst(A) 
    [B, IX] = sort(A);
    out  = mat2cell(IX, 1, diff(find(diff([-Inf,B,Inf])~=0)));
end

0.444075509687511  %// thewaywewalk
0.221888202987325  %// CST-Link

Surprisingly the version with stable accumarray is faster than the unstable one, due to the fact that Matlab prefers sorted arrays to work on.

Community
  • 1
  • 1
Robert Seifert
  • 25,078
  • 11
  • 68
  • 113
  • why the sortrows() and not just sort? – Oleg Feb 11 '16 at 20:52
  • @Oleg the explanation is in the linked answer: *We can use sortrows as a preprocessing step to sort the indices and corresponding values first SORTROWS uses a stable version of quicksort* – Robert Seifert Feb 11 '16 at 20:57
  • This also confused me at first, but my understanding is that `sortrows` and the `end:-1:1` column ordering are used in the linked answer to properly handle multi-dimensional subscripts. In this particular case, `subs` is always a vector and so `sort` and `sortrows` are equivalent. – KQS Feb 11 '16 at 21:10
  • @thewaywewalk nice linked answer. That is probably also where performance is lost. I usually use `accumarray()` to `mat2cell()` an array since it is faster, but I do not care about preserving order. – Oleg Feb 11 '16 at 21:26
3

This solution should work in O(N*log(N)) due sorting, but is quite memory intensive (requires 3x the amount of input memory):

[U, X] = sort(A);
    B  = mat2cell(X, 1, diff(find(diff([Inf,U,-Inf])~=0)));

I am curious about the performance though.