2

I have a 352x11 matrix, indexed by column 1 with 10 data points. Some of the index values are repeated. I'd like to find the repeated indices and calculate the mean data points for the repeated trials (avoiding loops, if possible).

For example,

x =

   26   77.5700   17.9735   32.7200
   27   40.5887   16.6100   31.5800
   28   60.4734   18.5397   33.6200
   28   35.6484   27.2000   54.8000
   29   95.3448   19.0000   37.7300
   30   82.7273   30.4394   39.1400

to end up with:

ans =

   26   77.5700   17.9735   32.7200
   27   40.5887   16.6100   31.5800
   28   48.0609   22.8699   44.2150
   29   95.3448   19.0000   37.7300
   30   82.7273   30.4394   39.1400

I was thinking if I used

J = find(diff(x(:,1))==0);

to find the position of the repeated values, I could then apply the function to the corresponding positions of x, but where do I begin?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
8eastFromThe3ast
  • 197
  • 1
  • 13

4 Answers4

6

A more general approach would employ unique to find the unique index values:

[U, ix, iu] = unique(x(:, 1));

and then accumarray:

[c, r] = meshgrid(1:size(x, 2), iu);
y = accumarray([r(:), c(:)], x(:), [], @mean);

Explanation

The input values to process are actually the second parameter of accumarray.

The first parameter of accumarray is a matrix, each row being a set of indices in the (accumulated) output matrix, and it corresponds to a value from the matching row in the vector given as the second parameter.

Think of the output as a cell array. The second parameters are the input values, and each row in the first parameter tells in which cell of the output matrix accumarray should store the corresponding input value. When output "cell array" is finished, a function (mean in our case) is applied to each cell.

Example

Here's a short example with a smaller matrix:

x = [27, 10, 8;
     28, 20, 10;
     28, 30, 50];

We find the unique values by:

[U, ix, iu] = unique(x(:, 1));

Vector U stores the unique values, and iu indicates which index of the value associated with each row (note that in this solution we have no use for ix ). In our case we get that:

U = 
    27
    28

iu =
    1
    2
    2

Now we apply accumarray:

[c, r] = meshgrid(1:size(x, 2), iu);
y = accumarray([r(:), c(:)], x(:), [], @mean);

The fancy trick with meshgrid and [r(:), c(:)] produces a set of indices:

[r(:), c(:)] =
     1     1
     2     1
     2     1
     1     2
     2     2
     2     2
     1     3
     2     3
     2     3

and these are the indices for the input values x(:), which is a column-vector equivalent of x:

x(:) =
    27
    28
    28
    10
    20
    30
     8
    10
    50

The process of accumulation:

  • The first value 27 goes to cell <1,1> in the output matrix.
  • The second value 28 goes to cell <2,1> in the output matrix.
  • The third value 28 goes to cell <2,1> in the output matrix.

See what just happened? Both values 28 get accumulated in the same cell (and eventually they will be averaged). The process continues:

  • The fourth value 10 goes to cell <1,2> in the output matrix.

and so on...

Once all values are stored in cells, the function mean is applied on each cell and we get the final output matrix:

y =
    27    10     8
    28    25    30
Dan
  • 45,079
  • 17
  • 88
  • 157
Eitan T
  • 32,660
  • 14
  • 72
  • 109
  • @8eastFromThe3ast I've added an explanation for you. – Eitan T Apr 18 '13 at 16:02
  • 1
    @8eastFromThe3ast This is the more reliable solution – Dan Apr 18 '13 at 16:07
  • 1
    @Eitan T Gosh that is detailed! Thanks very much. For the moment, Dan's kludge works very well, but when I have a bit more time i'll work my way through your suggestion and see which one works better! Thanks again – 8eastFromThe3ast Apr 18 '13 at 16:12
  • @8eastFromThe3ast No problem. The purpose is to learn, and we're all here to help each other. – Eitan T Apr 18 '13 at 16:13
  • 1
    Dan's and Eitan's solutions are basically the same, the only difference is in how the subs (labels) matrix is created. Eitan's solution is cleaner in my opinion and I personally prefer meshgrid, or even a couple of calls to repmat. Kron used to be slower in the past. – Oleg Apr 18 '13 at 16:17
  • I feel obliged to go with Eitan's, purely because of the explanation! It actually turns out that (for my data at least), Dan's solution is quicker (0.0781s vs 0.1094), but i'm guessing that might change with more data? – 8eastFromThe3ast Apr 18 '13 at 16:26
  • @8eastFromThe3ast Let Dan keep the checkmark, his solution _does_ work so he earned it. The bottom line is that you're encouraged to go with the solution that performs best for you. – Eitan T Apr 18 '13 at 16:36
  • @EitanT @Dan Because both of your solutions work so well, I'm thinking of applying the same concept to another calculation I need to do. Is it possible to index the data by two columns simultaneously, such that `accumarray` applies the `mean` function to each duplicate index, but only when the values in another column are the same? Would it be better if I asked a new question to that effect? – 8eastFromThe3ast Apr 19 '13 at 09:30
  • @8eastFromThe3ast a new question is preferred, but I strongly suggest that you give it a shot yourself first. It's all a matter of constructing the SUBS and VALS parameters of `accumarray` accordingly. – Eitan T Apr 19 '13 at 13:00
4

You can apply accumarray to multiple columns as shown here

labels = x(:,1) - min(x(:, 1)) + 1; 
labels = [repmat(labels(:),size(x,2),1), kron(1:size(x,2),ones(1,numel(labels))).'];             
totals = accumarray(labels,x(:),[], @mean);

This is adapted from Gnovice's code.

To get it to work for your code you then need to delete all the zeros in the front

totals(find(mean((totals == zeros(size(totals)))')), :) = [];

which results in the desired

   26.0000   77.5700   17.9735   32.7200
   27.0000   40.5887   16.6100   31.5800
   28.0000   48.0609   22.8699   44.2100
   29.0000   95.3448   19.0000   37.7300
   30.0000   82.7273   30.4394   39.1400
Community
  • 1
  • 1
Dan
  • 45,079
  • 17
  • 88
  • 157
  • It appears you've done it again, that's just the ticket! Thanks very much, I've been banging my head against the wall all day trying to figure something out. – 8eastFromThe3ast Apr 18 '13 at 15:56
  • This solution is not robust though (mostly because I don't understand the workings of it) - the fix at the end to get rid of the zeros is not a good sign. Like I'm not sure what would happen if there are negatives in the first column... it still needs work but I think it works for now. – Dan Apr 18 '13 at 15:57
  • `labels = x(:,1);` this line needs improving, unless x(:,1) only contains positive integers that are always increasing. The always increasing is easily solved with `sortrows` and the positive is easily solved with `labels = x(:,1) - min(x(:, 1)) + 1` which I think also fixes the zero padding issue – Dan Apr 18 '13 at 15:59
  • Yes, the first column is an index of participant number, so there are no negative values! It does the job perfectly for now, so i'll go with it. – 8eastFromThe3ast Apr 18 '13 at 16:08
  • I think I've fixed it now, I changed the first and last lines of the code – Dan Apr 18 '13 at 16:11
  • @Dan I guess our solutions are similar, the only thing that is different is the way we're creating the matrix of subscript indices. – Eitan T Apr 18 '13 at 16:16
  • @EitanT I've got to be honest, I may have made it work but I don't really understand it. Your solution doesn't require the removing zeros hack at the end though and comes with an explanation so seems like a better choice to me – Dan Apr 18 '13 at 16:26
0

You might find accumarray with @mean useful:

Assuming first column holds values 1 .. k for some k <= size(x,1), you may compute each column of the output using

col = accumarray( x(:,1), x(:,2), [], @mean ); % second column
Shai
  • 111,146
  • 38
  • 238
  • 371
  • I've tried using `accumarray`, but because i'd like to apply the function to multiple columns simultaneously, my `VAL` values are incompatible with the function. – 8eastFromThe3ast Apr 18 '13 at 15:36
  • @8eastFromThe3ast but you could just loop over the columns now – Dan Apr 18 '13 at 15:55
0

Given you input

x = [ ...
    26   77.5700   17.9735   32.7200; ...
    27   40.5887   16.6100   31.5800; ...
    28   60.4734   18.5397   33.6200; ...
    28   35.6484   27.2000   54.8000; ...
    29   95.3448   19.0000   37.7300; ...
    30   82.7273   30.4394   39.1400];

You can create an array of indexes where duplicated vgalues share the same index, using the third output of unique.

%Get index of unique values (1 - N)
[~, ~, ix] = unique(x(:,1))

Then you can use this array to rebuild your matrix, combining duplicated values with the function of your choice.

%Use accumarry to rebuild the matrix one column at a time
result = [...
    accumarray( ix, x(:,1), [], @max )  ...  %Many functions works here, as all inputs are the same.  E.G.  @mean, @max, @min
    accumarray( ix, x(:,2), [], @mean ) ...  %Use mean to combine data, per problem statement.
    accumarray( ix, x(:,3), [], @mean ) ...
    accumarray( ix, x(:,4), [], @mean ) ...
    ]
Shai
  • 111,146
  • 38
  • 238
  • 371
Pursuit
  • 12,285
  • 1
  • 25
  • 41
  • 1
    +1 for using `accumarray`, -1 for building columns one at a time :) you _can_ make `accumarray` build all columns at once if you build `ix` properly... – Eitan T Apr 18 '13 at 16:12
  • :) Point noted. I never use `accumaray` day-to-day; I'm still learning what it can do. – Pursuit Apr 18 '13 at 17:43