3

Say I have a matrix A whose first column contains some item IDs and second column contains either 0 or 1.

A=[3    1
   1    0
   4    0
   3    0
   1    1
   2    1
   3    1
   4    0
   2    0
   4    1
   3    1
   4    0
   2    1
   1    1
   2    0];

I want to find which item ID has the most 1's, and extract its entries from A, one by one. So, the way I do it is, I make a matrix B to extract all the 1 entries from A, find the most frequently occurring item ID, freq_item{1}, in B, and then extract all entries from A of that ID. Then, remove all instances of the most frequent item and search for the next most frequent item. If 2 or more items have the same number of 1's, choose the one with the bigger ratio of 1's:

B = A(A(:,2)==1,:);
for i=1:size(unique(A(:,1)),1)
   freq_item{i} = A(A(:,1)==mode(B(:,1)),:);
   B = B(B(:,1)~=mode(B(:,1)),:);
end

So, the output is:

freq_item{1,1}=[3     1
                3     0
                3     1
                3     1]

freq_item{1,2}=[1     0           
                1     1
                1     1]

freq_item{1,3}=[2     1
                2     0
                2     1
                2     0]

freq_item{1,4}=[4     0  
                4     0         
                4     1
                4     0]

But this code requires having the overhead of introducing the intermediate matrix B. Is there a code that can do this without the need for the intermediate matrix B and is at least as fast as the aforementioned code (i.e., its time complexity is less than or equal to that of the code written above)?

Kristada673
  • 3,512
  • 6
  • 39
  • 93
  • @StewieGriffin Yes, very large matrices. That's why it'd be good to be able to do away with the extra variable, if possible, and have the CPU bear the burden of executing a complicated one-liner code, instead of the hard disk bearing the burden of storing another matrix sized about half that of the original one. – Kristada673 Sep 20 '16 at 07:19
  • I don't know what you consider as the "extended" version, but none of the answers below deals with the options of ties as mentioned in the question: "choose the one with the bigger ratio of 1's". – EBH Sep 20 '16 at 20:16
  • @EBH Yeah, I noticed. But I've tested and found that all the codes, including the one in the question, just take care of that point by default. Maybe it's the way these functions are implemented in MATLAB. – Kristada673 Sep 20 '16 at 20:23
  • That's not true. At least not for the answers. Try switching all `1`'s and `2`'s in the ID column in `A`, so now `2` should come before `1`, and see if it happens. – EBH Sep 20 '16 at 20:32
  • Hmmm, ok, I'll try it out in some time. I checked them only on this random, made up matrix A, and it seemed fine. – Kristada673 Sep 20 '16 at 20:35

3 Answers3

5

Just another job for accumarray:

%// prep
subs = A(:,1);
vals = A(:,2);

%// find id with amximum occurences
[~, id] = max( accumarray(subs,vals) )

%// find indices of that id
idx = find(A == id)

%// filter output
out = A(idx,:)

or shorter

[~, id] = max( accumarray(A(:,1),A(:,2)) )
out = A(find(A == id),:)

Extended version

%// prep
subs = A(:,1);
vals = A(:,2);

%// find id with maximum occurences and mean values
sums = accumarray(subs,vals)
ratios = accumarray(subs,vals,[],@mean)
rows = 1:numel(sums)

%// distributing 
unsorted = accumarray(subs,1:numel(subs),[],@(x) {A(x,:)} )

%// sorting indices
[~,idx] = sortrows([sums(:),ratios(:),rows(:)],[-1 -2 3])
sorted = unsorted(idx)

sorted{1,2} =

     3     0
     3     1
     3     1
     3     1

sorted{2,2} =

     1     0
     1     1
     1     1

sorted{3,2} =

     2     0
     2     0
     2     1
     2     1

sorted{4,2} =

     4     1
     4     0
     4     0
     4     0
Robert Seifert
  • 25,078
  • 11
  • 68
  • 113
  • But your code uses far more extra variables than the code in question. – Kristada673 Sep 20 '16 at 07:10
  • @Kristada673 so put them together, have a look at my edit. What's the problem of intermediate variables? – Robert Seifert Sep 20 '16 at 07:13
  • Nothing. Its just that in all my previous questions where you answered with `accumarray`, there was a real need for it because my code used looping, and `accumarray` provided elegant, fast solutions. But here, `accumarray` and the code in question both perform exactly the same. The need for an extra variable is not removed with `accumarray`. I just wanted to know if there can be a one-liner to do this without introducing new variables. :-) – Kristada673 Sep 20 '16 at 07:17
  • @Kristada673 if you consider that `mode` is not a Matlab base function, and you need to pay extra for it, I'd use accumarray anyway ;) So there is need for accumarray ;) – Robert Seifert Sep 20 '16 at 07:48
  • I have edited the question to make it a bit more interesting, just for fun ;) See if your `accumarray` can work better now :) – Kristada673 Sep 20 '16 at 08:19
  • 1
    @Kristada673 have a look at my edit, nothing accumarray can't do – Robert Seifert Sep 20 '16 at 09:58
  • I have marked your solution as the answer, as `accumarray` does away with the need for the `for` loop. However, in the resultant `sorted` matrix, the ordering of the elements is lost. If that could be recovered, it'd be a perfect solution. – Kristada673 Sep 20 '16 at 14:01
  • @Kristada673 The reason is, that `accumarray` is not "stable", to fix this please apply [**this answer**](http://stackoverflow.com/questions/28463433/stable-accumarray-in-matlab) – Robert Seifert Sep 20 '16 at 14:07
  • @Kristada673 if this solution was considering the ties, then the "stable" solution was not a problem. Have a look at my answer for an alternative. – EBH Sep 20 '16 at 21:46
  • @Kristada673 I have overseen the request mentioned by EBH, I adressed it in my last edit. – Robert Seifert Sep 20 '16 at 21:58
1

First, find the most recurrent ID

mode(A(logical(A(:,2))))

This statement uses logical indexing to consider only the ones, as well as the mode function to return the most frequently occurring value.

You can also put it this way:

mode(A( A(:,2) == 1 ))

Then, extract the lines corresponding to this value

A( A(:,1) == mode(A( A(:,2) == 1 )),:)

Here we compare the first column (A(:,1)) to the most frequent ID, which returns a boolean vector. Logical indexing is then used to extract every matching line, and the corresponding columns.


Answer to the extended version of the question

[B,IX] = sort(histc(A( A(:,2) == 1 ),unique(A(:,1))),'descend');
freq={}; 
for a=IX'; 
   freq{end+1}=A(A(:,1) == subsref(unique(A(:,1)),struct('type','()','subs',{{a}})),:); 
end
Kristada673
  • 3,512
  • 6
  • 39
  • 93
MayeulC
  • 1,628
  • 17
  • 24
  • Your code doesn't give all the records of the most frequent ID. It outputs only 1 record. – Kristada673 Sep 20 '16 at 07:47
  • 1
    Ok, I didn't understand your question, then. How about `A(A(:,1) == mode(A( A(:,2) == 1 )) ,:)` ? – MayeulC Sep 20 '16 at 07:50
  • Ok, that works! You can edit your answer and put this code there so that I can mark it as the answer to this question :) – Kristada673 Sep 20 '16 at 07:52
  • Done. I added some explanations in case someone wonders how it works. – MayeulC Sep 20 '16 at 07:59
  • Actually, I have edited the question to make it a bit more interesting, just for fun ;) See if you can do without introducing a new variable now. – Kristada673 Sep 20 '16 at 08:18
  • 1
    I cheated a bit, since it was getting *really* complicated. I found multiple solutions, but how about something like `[B,IX] = sort(histc(A( A(:,2) == 1 ),unique(A(:,1))),'descend'); freq={}; for a=IX'; freq{end+1}=A(A(:,1) == subsref(unique(A(:,1)),struct('type','()','subs',{{a}})),:); end` That said, I tried to refrain from using `accumarray` this time :) – MayeulC Sep 20 '16 at 10:09
  • Cool solution! I have taken the liberty to edit your answer to reflect your solution to the extended version of the question. However, it doesn't get rid of the `for` loop, and is much more complicated than the solution in the question to understand. So, it kind of defeats the purpose, don't you think? :) – Kristada673 Sep 20 '16 at 13:56
1

Here is another option, using histcounts mostly. It's hard to compare this to @MayeulC solution, since it do does some more things. As for @thewaywewalk answer, in general if A is big then histcounts might be a better choice than accumarray. Here, however, this answer (with the for loop) is always faster (and it's getting better as A is bigger).

This is the non-intermediate-variables-very-slow-and-ugly version:

output = arrayfun(@(k) A(A(:,1)==k,:),...
    subsref(sortrows([...
    histcounts(A(logical(A(:,2)),1),min(A(:,1)):max(A(:,1))+1);
    histcounts(A(logical(A(:,2)),1),min(A(:,1)):max(A(:,1))+1)./...
    histcounts(A(:,1),min(A(:,1)):max(A(:,1))+1);
    min(A(:,1)):max(A(:,1))].',[-1 -2]),struct('type','()','subs',{{...
    subsref(sortrows([...
    histcounts(A(logical(A(:,2)),1),min(A(:,1)):max(A(:,1))+1);
    histcounts(A(logical(A(:,2)),1),min(A(:,1)):max(A(:,1))+1)./...
    histcounts(A(:,1),min(A(:,1)):max(A(:,1))+1);
    min(A(:,1)):max(A(:,1))].',[-1 -2]),struct('type','()','subs',{{':',2}}))>0,3}}))...
    ,'UniformOutput',false);

And this is the much more readable and faster version:

ID_range = (min(A(:,1)):max(A(:,1))+1);     % the IDs
one_count = histcounts(A(logical(A(:,2)),1),ID_range); % the count of 1's
zero_count = histcounts(A(:,1),ID_range);    % the count of 0's
sortedIDs = sortrows([one_count; one_count./zero_count; ID_range(1:end-1)].',[-1 -2]);
output = cell(numel(nonzeros(sortedIDs(:,1))),1);
for k = 1:numel(output)
    output{k} = A(A(:,1)==sortedIDs(k,3),:);
end

To test this you need another matrix, that it's ID are not already sorted this way:

A = [3 1;
    5 0;
    4 0;
    3 0;
    5 1;
    7 1; 
    3 1; 
    4 0;
    5 0; 
    4 1; 
    3 1;
    4 0; 
    7 1; 
    5 1; 
    7 0];

and we get the output:

output{1} =
     3     1
     3     0
     3     1
     3     1
output{2} =
     7     1
     7     1
     7     0
output{3} =
     5     0
     5     1
     5     0
     5     1
output{4} =
     4     0
     4     0
     4     1
     4     0
Community
  • 1
  • 1
EBH
  • 10,350
  • 3
  • 34
  • 59