8

I'm looking for suggestions on how to solve the following problem elegantly. Although performance isn't an issue in my specific case, I'd appreciate comments regarding good practices.

Thanks in advance!

The short version:

I'm trying to average matrix rows according to some logic, while ignoring NaN values. The code I currently have does not handle NaN values the way I want.

The long version:

My data is built in the following manner:

  • A single (first) column of "bins". The amount of rows for every bin is not constant. The bins don't have to be integers. Rows are pre-sorted.
  • A variable number of data columns, possibly including NaNs.

Here's an example:

DATA = [...
180     NaN     NaN     1.733
180     NaN     NaN     1.703
200     0.720   2.117   1.738
200     0.706   2.073   1.722
200     0.693   2.025   1.723
200     NaN     NaN     1.729
210     NaN     NaN     1.820
210     NaN     NaN     1.813
210     NaN     NaN     1.805
240     NaN     NaN     1.951
240     NaN     NaN     1.946
240     NaN     NaN     1.946
270     NaN     NaN     2.061
270     NaN     NaN     2.052
300     0.754   2.356   2.103
300     0.758   2.342   2.057
300     NaN     NaN     2.066
300     NaN     NaN     2.066 ];

The desired result is a matrix that contains the unique "bins" in the first column, and means "unspoiled by NaNs" in the rest, e.g.:

  • If for a specific column+bin, there are only NaNs (in the above example: 1st data column+bin 210) - the result would be NaN.
  • If for a specific column+bin there is a mix of NaNs and numbers, the result would be the mean of the valid numbers. In the above example: 1st data column+bin 200 should give (0.720+0.706+0.693)/3=0.7063 -- note the division by 3 (and not 4) for this column+bin.

Here's the desired result for the above example:

RES = [...
180     NaN     NaN     1.718
200     0.7063  2.072   1.728
210     NaN     NaN     1.812
240     NaN     NaN     1.948
270     NaN     NaN     2.056
300     0.756   2.349   2.074 ];

What I tried so far:

This is some code I managed to compile from several sources. It is working well for column+bin that contain NaNs or numbers only.

nDataCols=size(DATA,2)-1;
[u,m,n] = unique(DATA(:,1));
sz = size(m);
N=accumarray(n,1,sz);

RES(length(u),nDataCols) = 0; %Preallocation

for ind1 = 1:nDataCols
    RES(:,ind1)=accumarray(n,DATA(:,ind1+1),sz)./N;
end

RES= [u,RES];

Here's what I'm currently getting:

RES = [...
180     NaN     NaN     1.718
200     NaN     NaN     1.728
210     NaN     NaN     1.812
240     NaN     NaN     1.948
270     NaN     NaN     2.056
300     NaN     NaN     2.074 ];

p.s.

  1. If by any chance this is easier to do using a spreadsheet software (such as MS Excel) - I'd love to hear ideas.
  2. Doing the computation on a per-column basis is my current idea on how to handle this. I was just wondering if there's a way to generalize it to take the complete matrix right away.
Dev-iL
  • 23,742
  • 7
  • 57
  • 99
  • 6
    +1 for a clearly presented question. If only new users created questions like these :) – Amro Jul 13 '14 at 15:38
  • I agree. Very good question. I was working on a solution, only to be beaten by Luis Mendo again :P +1 anyway. – rayryeng Jul 13 '14 at 15:42
  • @rayryeng Sorry about that :-) – Luis Mendo Jul 13 '14 at 15:49
  • 2
    @Amro & rayryeng - thanks guys. It's because I don't think that tormenting potential readers with ill-posed questions is a good way of getting answers :) and rayryeng - if you have another solution I won't mind hearing it! – Dev-iL Jul 13 '14 at 16:00
  • This question should be a model on how to ask: clear statement, attempts, runnable code – Luis Mendo Jul 13 '14 at 16:11
  • 2
    @Luis - Let me propose that on meta real quick... :D – Dev-iL Jul 13 '14 at 16:13
  • 2
    @Dev-iL You'd definitely get my vote! :-) – Luis Mendo Jul 13 '14 at 16:19
  • Definitely agree. I'll vote on meta. @Dev-iL - The solution I originally had is more inefficient than the one Luis Mendo suggested. Let's just use his solution :) – rayryeng Jul 13 '14 at 16:45
  • @Dev-iL - I'll put mine up, but it's pretty inefficient in comparison to Luis Mendo's. Let's put it up for academic purposes then – rayryeng Jul 13 '14 at 17:18
  • Gentlemen, please see the following meta discussion: http://meta.stackoverflow.com/questions/265556/listing-exemplary-questions-on-the-help-center-asking-page – Dev-iL Jul 13 '14 at 17:33
  • 2
    @Dev-iL - Here is a good question that was asked with a good answer - http://stackoverflow.com/questions/24269516/how-i-obtain-bars-with-function-bar3-and-different-widths-for-each-bar. One of my favourites and answered by our resident expert - Luis Mendo – rayryeng Jul 13 '14 at 17:42

2 Answers2

5

One possible approach: find changes in first column (exploiting the fact that it's pre-sorted) and apply nanmean to each block of rows:

ind = find(diff([-inf; (DATA(:,1)); inf])~=0); %// value changed: start of block
r = arrayfun(@(n) nanmean(DATA(ind(n):ind(n+1)-1,:)), 1:numel(ind)-1, 'uni', 0);
RES = vertcat(r{:});

You can replace arrayfun by an explicit loop. That may be faster, and avoids the overhead introduced by cells:

ind = find(diff([-inf; (DATA(:,1)); inf])~=0); %// value changed: start of block
RES = zeros(numel(ind)-1, size(DATA,2)); %// preallocate
for n = 1:numel(ind)-1 %// loop over blocks
    RES(n,:) = nanmean(DATA(ind(n):ind(n+1)-1,:));
end

Your approach can be used as well. You only need to call accumarray with a handle to the nanmean function. This doesn't require the first column to be pre-sorted.

nDataCols = size(DATA,2)-1;
[u, ~, n] = unique(DATA(:,1));
RES = zeros(length(u), nDataCols); %// Preallocation
for ind1 = 1:nDataCols
    RES(:,ind1) = accumarray(n, DATA(:,ind1+1), [], @nanmean);
end
RES = [u, RES];
Community
  • 1
  • 1
Luis Mendo
  • 110,752
  • 13
  • 76
  • 147
  • 1
    +1 - Very nice. Didn't know about `nanmean`. Also cool way to find transitions. Another one for my book. – rayryeng Jul 13 '14 at 15:44
  • @Luis - I like the essence of your 1st solution (the fact that it's loop-less), yet going through cells seemingly creates extra overhead and therefore feels like a not-too-extensible idea performance-wise... Would you say that my worries are unfounded? – Dev-iL Jul 13 '14 at 15:51
  • @Dev-iL I think they are well-founded: cells introduce overhead. Besides, there _is_ a loop in my first apporach: `arrayfun` is a loop. In fact, it may be [slower than a loop](http://stackoverflow.com/questions/12522888/arrayfun-can-be-significantly-slower-than-an-explicit-loop-in-matlab-why). Your approach is good, and the loop is over number of columns, whereas my loop is over number of blocks. Depending on how many blocks or columns you have, you may prefer one or the other – Luis Mendo Jul 13 '14 at 15:53
  • @Dev-iL You could replace `arrayfun` by a loop over blocks. That way you avoid cells – Luis Mendo Jul 13 '14 at 15:54
  • Alright then. Thanks for your answer! I'll study this 1st solution you suggested - it looks intriguing. p.s. you can edit your post to disregard the `m` variable since it's not used: `[u,~,n] = unique(DATA(:,1));` ;) – Dev-iL Jul 13 '14 at 15:55
  • 1
    +1 I actually prefer the second solution. Just a note, `nanmean` is part of the Statistics toolbox, but it's easy to write your own version if you don't have it. – Amro Jul 13 '14 at 15:59
  • @Dev-iL Thanks for the tip. I've updated the answer with the explicit loop instread of `arrayfun` – Luis Mendo Jul 13 '14 at 15:59
  • @rayryeng You need to explicitly change `NaN`'s into `0`'s before appyling `sum` (as `nanmean` code does), otherwise `sum` returns `NaN` – Luis Mendo Jul 13 '14 at 16:07
  • @LuisMendo - I know, which is why I deleted my answer. I thought I'd be able to tackle it in two lines, but doing that 0-setting requires a bit more, so I'm just going to can my answer. Thanks though :) – rayryeng Jul 13 '14 at 16:08
  • 1
    @Luis - I revisited this question and noticed something I want to direct your attention to: the difference in preallocation methods that you and I use - you use `zeros` while I assign a zero in the corner. || Though strange, the reason I do this is rooted in a seminar by [Yair Altman](http://stackoverflow.com/users/233829/yair-altman) I once heard, which is summarized [in this post](http://undocumentedmatlab.com/blog/preallocation-performance#variants). Long story short - there is (correct to 2012) a large performance impact by using `zeros`. (Disclosure: I did not benchmark this personally) – Dev-iL Aug 31 '14 at 18:43
  • 1
    Just thought I mention that as of 2015a, `mean(...,'omitnan')` is part of standard MATLAB, if the stats toolbox is not available. – zeeMonkeez Apr 07 '16 at 02:28
0

Here's another solution, though grossly inefficient. Also, the output array will set all of the NaN values to 0. Let's just say this is good for academic study. Here are the steps that I did:

  1. For each ID you have in the first column, find a unique list.
  2. For the other columns, split each column up into a cell array.
  3. Create a new cell array where each column gets appended with the first column for each element in this cell array
  4. Filter out those rows for each cell array that contain a NaN value
  5. For each column of the filtered result, run accumarray with mean as a function handle.
  6. Using the IDs in Step #1, index each accumarray result and transform back into a matrix

%// Step #1
num = unique(DATA(:,1));

%// Step #2
cells = mat2cell(DATA, size(DATA,1), ones(size(DATA,2),1));

%// Step #3
cellsAppend = cellfun(@(x) [DATA(:,1) x], cells(2:end), 'uni', false);

%// Step #4
cellsNonNaN = cellfun(@(x) x(~isnan(x(:,2)),:), cellsAppend , 'uni', false);

%// Step #5
cellsMean = cellfun(@(x) accumarray(x(:,1), x(:,2), [], @mean), cellsNonNaN, 'uni', false);

%// Step #6
selectCells = cellfun(@(x) x(num), append3, 'uni', false);
RES = [num cell2mat(selectCells)];

The result is:

RES = 

180.0000         0         0    1.7180
200.0000    0.7063    2.0717    1.7280
210.0000         0         0    1.8127
240.0000         0         0    1.9477
270.0000         0         0    2.0565
300.0000    0.7560    2.3490    2.0730

As you can see, pretty inefficient - especially with the amount of cellfun calls I made, but still an academic example I guess!

rayryeng
  • 102,964
  • 22
  • 184
  • 193
  • You should not be discouraged by the fact that your solution to a specific problem seems inefficient to you... Sometimes *effectiveness* is more important than efficiency. Generally speaking, since you took the time to answer and saw what was suggested by others - you benefitted directly and so will others that happen to stop by this question in the future :-) – Dev-iL Jul 13 '14 at 17:59