How to accumulate (average) data based on multiple criteria

Question

I have a set of data where I have recorded values in sets of 3 readings (so as to be able to obtain a general idea of the SEM). I have them recorded in a list that looks as follows, which I am trying to collapse into averages of each set of 3 points:

I want to collapse essentially each 3 rows into one row where the average data value is given for that set. In essence, it would look as follows:

This is something I know how to do basically in Excel (i.e. using a Pivot table) but I am not sure how to do the same in MATLAB. I have tried using accumarray but struggle with knowing how to incorporate multiple conditions essentially. I would need to create a subs array where its number corresponds to each unique set of 3 data points. By brute force, I could create an array such as:

subs = [1 1 1; 2 2 2; 3 3 3; 4 4 4; ...]'

using some looping and have that as my subs array, but since it isn't tied to the data itself, and there may be strange hiccups throughout (i.e. more than 3 data points per set, or missing data, etc.). I know there must be some way to have this sort of Pivot-table-esque grouping for something like this, but need some help to get it off the ground. Thanks.

Here is the input data in text form:

Subject  Flow   On/Off   Values
1        10     1        2.20
1        10     1        2.50
1        10     1        2.60
1        20     1        5.50
1        20     1        6.10
1        20     1        5.90
1        30     1        10.10
1        30     1        10.50
1        30     1        10.50
1        10     0        1.90
1        10     0        2.20
1        10     0        2.30
1        20     0        5.20
1        20     0        5.80
1        20     0        5.60
1        30     0        9.80
1        30     0        10.20
1        30     0        10.20
2        10     1        5.70
2        10     1        6.00
2        10     1        6.10
2        20     1        9.00
2        20     1        9.60
2        20     1        9.40
2        30     1        13.60
2        30     1        14.00
2        30     1        14.00
2        10     0        5.40
2        10     0        5.70
2        10     0        5.80
2        20     0        8.70
2        20     0        9.30
2        20     0        9.10
2        30     0        13.30
2        30     0        13.70
2        30     0        13.70

Can you please paste the input data as text too? Also, how do you ger for example output `2.13`? What numbers are averaged there? — Luis Mendo, May 16 '17 at 21:28
Hi @LuisMendo, thanks for looking at this question. I uploaded the text data just now. — teepee, May 16 '17 at 21:35

score 4 · Answer 1 · answered May 16 '17 at 21:44

4

You can use unique and accumarray like so to maintain the order of your rows of data:

[newData, ~, subs] = unique(data(:, 1:3), 'rows', 'stable');
newData(:, 4) = accumarray(subs, data(:, 4), [], @mean);

newData =

    1.0000   10.0000    1.0000    2.4333
    1.0000   20.0000    1.0000    5.8333
    1.0000   30.0000    1.0000   10.3667
    1.0000   10.0000         0    2.1333
    1.0000   20.0000         0    5.5333
    1.0000   30.0000         0   10.0667
    2.0000   10.0000    1.0000    5.9333
    2.0000   20.0000    1.0000    9.3333
    2.0000   30.0000    1.0000   13.8667
    2.0000   10.0000         0    5.6333
    2.0000   20.0000         0    9.0333
    2.0000   30.0000         0   13.5667

answered May 16 '17 at 21:44

gnovice

125,304
15
256
359

Thank you @gnovice for your help; this is really great. I also wonder: do you know what could be done if one of the columns of data was of another data type (i.e. date cells) and thus couldn't be combined into the `data` array, such as in this example? – teepee May 17 '17 at 20:02
@teepee: If you have a date, you can convert it to a numeric value using [`datenum`](https://www.mathworks.com/help/matlab/ref/datenum.html). Then you can add it as another column in `data`. – gnovice May 17 '17 at 20:24
Ah, OK, that's the way to do it. Great, thanks very much. I have already implemented a workaround by compiling the data into a `table`, as `unique()` accepts this type and can spit out the `subs` I need. Is there any particular disadvantage to this approach as an alternative? – teepee May 17 '17 at 20:50
@teepee: No, tables are good ways to collect and display different types of data. – gnovice May 17 '17 at 20:56

Luis Mendo · Accepted Answer · 2017-05-16T22:16:27.703

3

I assume that

You want to average based on unique values of the first three columns (not on groups of three rows, although the two criteria coincide in your example);
Order is determined by column 1, then 3, then 2.

Then, denoting your data as x,

[~, ~, subs] = unique(x(:, [1 3 2]), 'rows', 'sorted');
result = accumarray(subs, x(:,end), [], @mean);

gives

As you see, I am using the third output of unique with the 'rows' and 'sorted' options. This creates the subs grouping vector based on first three columns of your data in the desired order. Then, passing that to accumarray computes the means.

edited May 16 '17 at 22:16

answered May 16 '17 at 21:36

Luis Mendo

110,752
13
76
147

Thank you for your answer! What would be the advantage of using the `'sorted'` type vs. using `'stable'`? – teepee May 16 '17 at 22:09
Without `'sorted'` you don't get the results in the same order you indicate in your question – Luis Mendo May 16 '17 at 22:15
1

Ah nice, thanks very much; very clear and concise way of doing it. Thanks! – teepee May 16 '17 at 22:20
I also wonder, @LuisMendo, were I to include another column--for example, date values--how could I do this same procedure, given that I cannot combine a column of dates into an array with the rest of the data? – teepee May 17 '17 at 20:00

score 0 · Answer 3 · answered May 16 '17 at 21:54

accumarray is indeed the way to go. First, you'll need assign an index to each set of values with unique :

[unique_subjects, ~, ind_subjects] = unique(vect_subjects);
[unique_flows, ~, ind_flows] = unique(vect_flows);
[unique_on_off, ~, ind_on_off] = unique(vect_on_off);

So basically, you now got ind_subjects, ind_flows and ind_on_off that are values in [1..2], [1..3] and [1..2].

Now, you can compute the mean values in a [3x2x2] array (in you example) :

mean_values = accumarray([ind_flows, ind_on_off, ind_subjects], vect_values, [], @mean);
mean_values = mean_values(:);

Nota : order is set accordingly to your example.

Then you can construct the summary :

[ind1, ind2, ind3] = ndgrid(1:numel(unique_flows), 1:numel(unique_on_off), 1:numel(unique_subjects));
flows_summary = unique_flows(ind1(:));
on_off_summary = unique_on_off(ind2(:));
subjects_summary = unique_subjects(ind3(:));

Nota : Also works with non numeric values.

score 0 · Answer 4 · answered May 17 '17 at 14:26

You should also try checking out the findgroups and splitapply reference pages. The easiest way to use them here is probably to place your data in a table:

 >> T = array2table(data, 'VariableNames', { 'Subject', 'Flow', 'On_Off', 'Values'});
 >> [gid,Tgrp] = findgroups(T(:,1:3));
 >> Tgrp.MeanValue = splitapply(@mean, T(:,4), gid)
 Tgrp =
   12×4 table
     Subject    Flow    On_Off    MeanValue
     _______    ____    ______    _________
     1          10      0         2.1333   
     1          10      1         2.4333   
     1          20      0         5.5333   
     1          20      1         5.8333   
     1          30      0         10.067   
     1          30      1         10.367   
     2          10      0         5.6333   
     2          10      1         5.9333   
     2          20      0         9.0333   
     2          20      1         9.3333   
     2          30      0         13.567   
     2          30      1         13.867

How to accumulate (average) data based on multiple criteria

4 Answers4