0

For a current project, I have to discretize quasi-continuous values into bins defined by some pre-defined binning resolution. For this purpose, I have written a function, which I expected to be highly efficient as it is able to both process scalar inputs as well as vector inputs using bsxfun. However, after some profiling, I found out that almost all processing time of my much larger project is produced in this function, and within the function, it's mainly the bsxfun part that takes time, with the min-query following on second place. Long story short, I am looking for advice on how to solve this task MUCH faster in MATLAB. Side note: I am usually passing vectors with some 50k elements.

Here's the code:

function sampleNo = value2sample(value,bins)

%Make sure both vectors have orientations fitting bsxfun
value = value(:);
bins = bins(:)';

%Recover bin resolution (avoids passing another parameter)
delta = median(diff(bins));

%Calculate distance matrix between all combinations
dist = abs(bsxfun(@minus,value,bins));

%What we really want to know is the minimum distance per row
[minval,ind] = min(dist,[],2);

%Make sure we don't accidentally further process NaNs as 1st bin
ind(isnan(minval))=NaN;

sampleNo = ind;
sampleNo(minval>delta) = NaN;

end
Michael
  • 280
  • 2
  • 13
  • well, 50k elements may not seem much, but after `bsxfun`: 50000x50000 = 2.5 Billions elements to calculate. I wouldn't expect anything instantaneous. Have you tried running it in a loop to see the difference ? You may already be at the minimal execution time ... – Hoki Aug 16 '16 at 14:36
  • that said, if you only want the minimum per row, there might be some clever way to sort your input vectors and avoid calculating each of the 2.5 billions possibilities ... that won't be with `bsxfun` though. – Hoki Aug 16 '16 at 14:38
  • Maybe I should be more precise: Typicall, $bins$ is about 3,000 elements long, and $values$ somewhere between 50k and 60k. – Michael Aug 16 '16 at 14:49

2 Answers2

1

The reason that your function is slow is because you are computing the distance between every element of values and bins and storing them all in an array - if there are N values and M bins then you will require NM elements to store all the distances, and this is probably a really big number (e.g. if each input has 50,000 elements then you need 2.5 billion elements in the output array).

Moreover, since your bins are sorted (you didn't state this, but it looks like you are assuming it in your code) you do not need to compute the distance from every value to every bin. You can be much smarter,

function ind = value2sample(value, bins)

    % Find median bin distance
    delta = median(diff(bins));

    % Bucket into 'nearest' bin by using midpoints
    bins = bins(:);
    mids = [-Inf; 0.5 * (bins(1:end-1) + bins(2:end))];

    [~, ind] = histc(value, mids);

    % Ensure that NaN values and points that aren't near any bin are returned as NaN
    ind(isnan(value)) = NaN;
    ind(abs(value - bins(ind)) > delta) = NaN;

end

In my tests, with values = randn(10000, 1) and bins = -50:50 it takes around 4.5 milliseconds to run the original function, and 485 microseconds to run the code above, so you are getting around a 10x speedup (and the speedup will be even greater as you increase the size of the inputs).

Chris Taylor
  • 46,912
  • 15
  • 110
  • 154
  • Excellent inspiration, thank you. However, if $$value$$ contains NaNs or values that exceed the maximum value of $$bins$$ (which is indeed sorted, you are right), the function will throw an error in the last line before the end, as "subscript indices must either be real positive integers or logicals". – Michael Aug 16 '16 at 15:02
  • I tried to solve it by removing the respective elements, but this is not an option as the resulting ind-vector should have the same length as the input value vector. – Michael Aug 16 '16 at 15:08
  • @Michael the solution is to remove all NaN values from `value`, and any values larger than the largest bin, before calling the function. You can write a wrapper around the function which does this. Alternatively you can add an extra point `Inf` to the end of `mids` to capture elements that are larger than `max(bin)` and then throw out the ones that are more than `delta` away from `max(bin)`. – Chris Taylor Aug 16 '16 at 15:30
0

Thanks to @Chris Taylor, I was able to solve the problem very efficiently. The code now runs almost 400 times faster than before. The only changes I had to make from his version are reflected in the code below. Main issue was to replace histc (whose use is not encouraged anymore) by discretize.

function ind = value2sample(value, bins)

% Make sure the vectors are standing
value = value(:);
bins = bins(:);

% Bucket into 'nearest' bin by using midpoints
mids = [eps; 0.5 * (bins(1:end-1) + bins(2:end))];

ind = discretize(value, mids);

The only thing is, that in this implementation your bins must be non-negative. Other than that, this code does exactly what I want, including the fact that ind has the same size as value and contains NaNs whenever a value is NaN or out of the range of bins.

Michael
  • 280
  • 2
  • 13
  • 1
    Have you tried using `histcounts`? It's the function that replaces `histc` and [it may be the fastest method](http://stackoverflow.com/q/38941694/2627163) – EBH Aug 17 '16 at 08:01
  • Thanks for the remark. However, both `histcounts` and `discretize` are heirs to `histc`. Thus, I won't bother with trying more options, given that I have already achieved a significant speed-up. :) – Michael Aug 17 '16 at 15:15
  • 1
    They are similar, but here's the difference according to [MATLAB docs](http://www.mathworks.com/help/matlab/ref/discretize.html?searchHighlight=discretize#moreabout): _"The behavior of discretize is similar to that of the histcounts function. Use histcounts to find the number of elements in each bin. On the other hand, use discretize to find which bin each element belongs to (without counting)."_. Anyway I'm glad you find what works for you. – EBH Aug 17 '16 at 15:40