Fast way to extract the frequency of elements in a vector grouped by indices in another vector?

Question

Say I have a matrix A whose first column contains some item IDs and second column contains either 0 or 1.

I want to find which item ID has the most 1's, and extract its entries from A, one by one. So, the way I do it is, I make a matrix B to extract all the 1 entries from A, find the most frequently occurring item ID, freq_item{1}, in B, and then extract all entries from A of that ID. Then, remove all instances of the most frequent item and search for the next most frequent item. If 2 or more items have the same number of 1's, choose the one with the bigger ratio of 1's:

B = A(A(:,2)==1,:);
for i=1:size(unique(A(:,1)),1)
   freq_item{i} = A(A(:,1)==mode(B(:,1)),:);
   B = B(B(:,1)~=mode(B(:,1)),:);
end

So, the output is:

freq_item{1,1}=[3     1
                3     0
                3     1
                3     1]

freq_item{1,2}=[1     0           
                1     1
                1     1]

freq_item{1,3}=[2     1
                2     0
                2     1
                2     0]

freq_item{1,4}=[4     0  
                4     0         
                4     1
                4     0]

But this code requires having the overhead of introducing the intermediate matrix B. Is there a code that can do this without the need for the intermediate matrix B and is at least as fast as the aforementioned code (i.e., its time complexity is less than or equal to that of the code written above)?

@StewieGriffin Yes, very large matrices. That's why it'd be good to be able to do away with the extra variable, if possible, and have the CPU bear the burden of executing a complicated one-liner code, instead of the hard disk bearing the burden of storing another matrix sized about half that of the original one. — Kristada673, Sep 20 '16 at 07:19
I don't know what you consider as the "extended" version, but none of the answers below deals with the options of ties as mentioned in the question: "choose the one with the bigger ratio of 1's". — EBH, Sep 20 '16 at 20:16
@EBH Yeah, I noticed. But I've tested and found that all the codes, including the one in the question, just take care of that point by default. Maybe it's the way these functions are implemented in MATLAB. — Kristada673, Sep 20 '16 at 20:23
That's not true. At least not for the answers. Try switching all `1`'s and `2`'s in the ID column in `A`, so now `2` should come before `1`, and see if it happens. — EBH, Sep 20 '16 at 20:32
Hmmm, ok, I'll try it out in some time. I checked them only on this random, made up matrix A, and it seemed fine. — Kristada673, Sep 20 '16 at 20:35

Robert Seifert · Accepted Answer · 2016-09-20T21:57:58.077

5

Just another job for accumarray:

%// prep
subs = A(:,1);
vals = A(:,2);

%// find id with amximum occurences
[~, id] = max( accumarray(subs,vals) )

%// find indices of that id
idx = find(A == id)

%// filter output
out = A(idx,:)

or shorter

[~, id] = max( accumarray(A(:,1),A(:,2)) )
out = A(find(A == id),:)

Extended version

%// prep
subs = A(:,1);
vals = A(:,2);

%// find id with maximum occurences and mean values
sums = accumarray(subs,vals)
ratios = accumarray(subs,vals,[],@mean)
rows = 1:numel(sums)

%// distributing 
unsorted = accumarray(subs,1:numel(subs),[],@(x) {A(x,:)} )

%// sorting indices
[~,idx] = sortrows([sums(:),ratios(:),rows(:)],[-1 -2 3])
sorted = unsorted(idx)

sorted{1,2} =

     3     0
     3     1
     3     1
     3     1

sorted{2,2} =

     1     0
     1     1
     1     1

sorted{3,2} =

     2     0
     2     0
     2     1
     2     1

sorted{4,2} =

     4     1
     4     0
     4     0
     4     0

edited Sep 20 '16 at 21:57

answered Sep 20 '16 at 07:06

Robert Seifert

25,078
11
68
113

But your code uses far more extra variables than the code in question. – Kristada673 Sep 20 '16 at 07:10
@Kristada673 so put them together, have a look at my edit. What's the problem of intermediate variables? – Robert Seifert Sep 20 '16 at 07:13
Nothing. Its just that in all my previous questions where you answered with `accumarray`, there was a real need for it because my code used looping, and `accumarray` provided elegant, fast solutions. But here, `accumarray` and the code in question both perform exactly the same. The need for an extra variable is not removed with `accumarray`. I just wanted to know if there can be a one-liner to do this without introducing new variables. :-) – Kristada673 Sep 20 '16 at 07:17
@Kristada673 if you consider that `mode` is not a Matlab base function, and you need to pay extra for it, I'd use accumarray anyway ;) So there is need for accumarray ;) – Robert Seifert Sep 20 '16 at 07:48
I have edited the question to make it a bit more interesting, just for fun ;) See if your `accumarray` can work better now :) – Kristada673 Sep 20 '16 at 08:19
1

@Kristada673 have a look at my edit, nothing accumarray can't do – Robert Seifert Sep 20 '16 at 09:58
I have marked your solution as the answer, as `accumarray` does away with the need for the `for` loop. However, in the resultant `sorted` matrix, the ordering of the elements is lost. If that could be recovered, it'd be a perfect solution. – Kristada673 Sep 20 '16 at 14:01
@Kristada673 The reason is, that `accumarray` is not "stable", to fix this please apply [**this answer**](http://stackoverflow.com/questions/28463433/stable-accumarray-in-matlab) – Robert Seifert Sep 20 '16 at 14:07
@Kristada673 if this solution was considering the ties, then the "stable" solution was not a problem. Have a look at my answer for an alternative. – EBH Sep 20 '16 at 21:46
@Kristada673 I have overseen the request mentioned by EBH, I adressed it in my last edit. – Robert Seifert Sep 20 '16 at 21:58

score 1 · Answer 2 · edited Sep 20 '16 at 14:30

1

First, find the most recurrent ID

mode(A(logical(A(:,2))))

This statement uses logical indexing to consider only the ones, as well as the mode function to return the most frequently occurring value.

You can also put it this way:

mode(A( A(:,2) == 1 ))

Then, extract the lines corresponding to this value

A( A(:,1) == mode(A( A(:,2) == 1 )),:)

Here we compare the first column (A(:,1)) to the most frequent ID, which returns a boolean vector. Logical indexing is then used to extract every matching line, and the corresponding columns.

Answer to the extended version of the question

[B,IX] = sort(histc(A( A(:,2) == 1 ),unique(A(:,1))),'descend');
freq={}; 
for a=IX'; 
   freq{end+1}=A(A(:,1) == subsref(unique(A(:,1)),struct('type','()','subs',{{a}})),:); 
end

edited Sep 20 '16 at 14:30

Kristada673

3,512
6
39
93

answered Sep 20 '16 at 07:43

MayeulC

1,628
17
24

Your code doesn't give all the records of the most frequent ID. It outputs only 1 record. – Kristada673 Sep 20 '16 at 07:47
1

Ok, I didn't understand your question, then. How about `A(A(:,1) == mode(A( A(:,2) == 1 )) ,:)` ? – MayeulC Sep 20 '16 at 07:50
Ok, that works! You can edit your answer and put this code there so that I can mark it as the answer to this question :) – Kristada673 Sep 20 '16 at 07:52
Done. I added some explanations in case someone wonders how it works. – MayeulC Sep 20 '16 at 07:59
Actually, I have edited the question to make it a bit more interesting, just for fun ;) See if you can do without introducing a new variable now. – Kristada673 Sep 20 '16 at 08:18
1

I cheated a bit, since it was getting *really* complicated. I found multiple solutions, but how about something like `[B,IX] = sort(histc(A( A(:,2) == 1 ),unique(A(:,1))),'descend'); freq={}; for a=IX'; freq{end+1}=A(A(:,1) == subsref(unique(A(:,1)),struct('type','()','subs',{{a}})),:); end` That said, I tried to refrain from using `accumarray` this time :) – MayeulC Sep 20 '16 at 10:09
Cool solution! I have taken the liberty to edit your answer to reflect your solution to the extended version of the question. However, it doesn't get rid of the `for` loop, and is much more complicated than the solution in the question to understand. So, it kind of defeats the purpose, don't you think? :) – Kristada673 Sep 20 '16 at 13:56

score 1 · Answer 3 · edited May 23 '17 at 12:08

Here is another option, using histcounts mostly. It's hard to compare this to @MayeulC solution, since it do does some more things. As for @thewaywewalk answer, in general if A is big then histcounts might be a better choice than accumarray. Here, however, this answer (with the for loop) is always faster (and it's getting better as A is bigger).

This is the non-intermediate-variables-very-slow-and-ugly version:

output = arrayfun(@(k) A(A(:,1)==k,:),...
    subsref(sortrows([...
    histcounts(A(logical(A(:,2)),1),min(A(:,1)):max(A(:,1))+1);
    histcounts(A(logical(A(:,2)),1),min(A(:,1)):max(A(:,1))+1)./...
    histcounts(A(:,1),min(A(:,1)):max(A(:,1))+1);
    min(A(:,1)):max(A(:,1))].',[-1 -2]),struct('type','()','subs',{{...
    subsref(sortrows([...
    histcounts(A(logical(A(:,2)),1),min(A(:,1)):max(A(:,1))+1);
    histcounts(A(logical(A(:,2)),1),min(A(:,1)):max(A(:,1))+1)./...
    histcounts(A(:,1),min(A(:,1)):max(A(:,1))+1);
    min(A(:,1)):max(A(:,1))].',[-1 -2]),struct('type','()','subs',{{':',2}}))>0,3}}))...
    ,'UniformOutput',false);

And this is the much more readable and faster version:

ID_range = (min(A(:,1)):max(A(:,1))+1);     % the IDs
one_count = histcounts(A(logical(A(:,2)),1),ID_range); % the count of 1's
zero_count = histcounts(A(:,1),ID_range);    % the count of 0's
sortedIDs = sortrows([one_count; one_count./zero_count; ID_range(1:end-1)].',[-1 -2]);
output = cell(numel(nonzeros(sortedIDs(:,1))),1);
for k = 1:numel(output)
    output{k} = A(A(:,1)==sortedIDs(k,3),:);
end

To test this you need another matrix, that it's ID are not already sorted this way:

and we get the output:

output{1} =
     3     1
     3     0
     3     1
     3     1
output{2} =
     7     1
     7     1
     7     0
output{3} =
     5     0
     5     1
     5     0
     5     1
output{4} =
     4     0
     4     0
     4     1
     4     0

nice catch, I edited my answer to include your hint. – Robert Seifert Sep 20 '16 at 22:02 — Robert Seifert, Sep 20 '16 at 22:02

Fast way to extract the frequency of elements in a vector grouped by indices in another vector?

3 Answers3

Extended version

Answer to the extended version of the question