How can I vectorize code that runs a function on subsets of a larger matrix?

Question

Let's assume I have the following 9 x 5 matrix:

myArray = [
   54.7    8.1   81.7   55.0   22.5
   29.6   92.9   79.4   62.2   17.0
   74.4   77.5   64.4   58.7   22.7
   18.8   48.6   37.8   20.7   43.5
   68.6   43.5   81.1   30.1   31.1
   18.3   44.6   53.2   47.0   92.3
   36.8   30.6   35.0   23.0   43.0
   62.5   50.8   93.9   84.4   18.4
   78.0   51.0   87.5   19.4   90.4
];

I have 11 "subsets" of this matrix and I need to run a function (let's say max) on each of these subsets. The subsets can be identified with the following matirx of logicals (identified column-wise, not row-wise):

myLogicals = logical([
    0 1 0 1 1
    1 1 0 1 1
    1 1 0 0 0
    0 1 0 1 1
    1 0 1 1 1
    1 1 1 1 0
    0 1 1 0 1
    1 1 0 0 1
    1 1 0 0 1
]);

or via linear indexing:

starts = [2 5 8 10 15 23 28 31 37 40 43]; #%index start of each subset
ends =   [3 6 9 13 18 25 29 33 38 41 45]; #%index end of each subset

such that the first subset is 2:3, the second is 5:6, and so on.

I can find the max of each subset and store it in a vector as follows:

finalAnswers = NaN(11,1); 
for n=1:length(starts) #%i.e. 1 through the number of subsets
    finalAnswers(n) = max(myArray(starts(n):ends(n)));
end

After the loop runs, finalAnswers contains the maximum value of each of the data subsets:

74.4  68.6  78.0  92.9  51.0  81.1  62.2  47.0  22.5  43.5  90.4

Is it possible to obtain the same result without the use of a for loop? In other words, can this code be vectorized? Would such an approach be more efficient than the current one?

EDIT: I did some testing of the proposed solutions. The data I used was a 1,510 x 2,185 matrix with 10,103 subsets that varied in length from 2 to 916 with a standard deviation of subset length of 101.92.

I wrapped each solution in tic;for k=1:1000 [code here] end; toc; and here are the results:

for loop approach --- Elapsed time is 16.237400 seconds.
Shai's approach --- Elapsed time is 153.707076 seconds.
Dan's approach --- Elapsed time is 44.774121 seconds.
Divakar's approach #2 --- Elapsed time is 127.621515 seconds.

Notes:

I also tried benchmarking Dan's approach by wrapping the k=1:1000 for loop around just the accumarray line (since the rest could be theoretically run just once). In this case the time was 28.29 seconds.
Benchmarking Shai's approach, while leaving the lb = ... line out of the k loop, the time was 113.48 seconds.
When I ran Divakar's code, I got Non-singleton dimensions of the two input arrays must match each other. errors for the bsxfun lines. I "fixed" this by using conjugate transposition (the apostrophe operator ') on trade_starts(1:starts_extent) and intv(1:starts_extent) in the lines of code calling bsxfun. I'm not sure why this error was occuring...

I'm not sure if my benchmarking setup is correct, but it appears that the for loop actually runs the fastest in this case.

In your code, by any chance are you creating `myLogicals` from `starts` and `ends` or is it there just like `starts` and `ends` or maybe the other way around? — Divakar, Sep 29 '14 at 15:05
Also, do you have access to a GPU? Vectorization is suited a lot when ported to GPU with [**`gpuArrays`**](http://www.mathworks.in/help/distcomp/gpuarray.html). — Divakar, Sep 29 '14 at 18:31
@Divakar Question 1: I create `myLogicals` first and then get `starts` and `ends` from that. Question 2: I do have access to a relatively powerful GPU (680 GTX), but I am completely unfamiliar with `gpuArrays`. Do you happen to have a link to a good source online where I could learn more? This sounds really interesting — Alarik, Sep 30 '14 at 00:02
Let me ask you one thing - You said you are creating `"10,103 subsets that varied in length from 2 to 916 with a standard deviation of subset length of 101.92"`. Is that your actual data or you are just doing that to benchmark the solutions. Vectorization yields good result when you are working with large data chunk of homogeneous data, but with such a high value of standard deviation - `101.92`, it's just not suitable for vectorization techniques and as such for-loop would always be the best solution. — Divakar, Sep 30 '14 at 03:42
Regarding the error in my solution, you don't need to do any change in there, but the error could be because of the way you have your data setup. I was assuming `starts` and `ends` to have sorted data and there mustn't be any overlap between two subsets. Are these criteria maintained? — Divakar, Sep 30 '14 at 03:56
@Divakar Question 1: This matrix is a part of the actual data and very indicative of the sort of data/subset distributions I will be dealing with. " Vectorization yields good result when you are working with large data chunk of homogeneous data" -- I did not know that; thanks for this statement, I will keep that in mind going forward. Question 2: I just double-checked, `starts` and `ends` are in fact sorted and there is no overlap between the two — Alarik, Sep 30 '14 at 06:35
Is there any chance that you could share `myLogicals`, `starts`, `ends` with us? Like uploading those as a mat file and linking it here if it's not too much of a ask and if the data isn't classified per se? — Divakar, Sep 30 '14 at 06:41

Dan · Answer 1 · 2014-09-29T16:27:26.783

3

One approach is to use accumarray. Unfortunately in order to do that we first need to "label" your logical matrix. Here is a convoluted way of doing that if you don't have the image processing toolbox:

sz=size(myLogicals);
s_ind(sz(1),sz(2))=0;
%// OR: s_ind = zeros(size(myLogicals))

s_ind(starts) = 1;
labelled = cumsum(s_ind(:)).*myLogicals(:);

So that just does what Shai's bwlabeln implementation does (but this will be 1-by-numel(myLogicals) in shape as opposed to size(myLogicals) in shape)

Now you can use accumarray:

accumarray(labelled(myLogicals), myArray(myLogicals), [], @max)

or else it may be faster to try

result = accumarray(labelled+1, myArray(:), [], @max);
result = result(2:end)

This is fully vectorized, but is it worth it? You'll have to do speed tests against your loop solution to know.

edited Sep 29 '14 at 16:27

answered Sep 29 '14 at 13:54

Dan

45,079
17
88
157

I tried running the solutions provided on some of my actual data (a 1510 x 2185 matrix). After about 10 runs each, these were the average results: `for` loop, about 0.018 seconds; Dan's solution, about 0.057 seconds; Shai's solution, about 0.150 seconds. So it appears that the `for` loop solution is the most efficeint of these for my specific data and/or computer setup – Alarik Sep 29 '14 at 14:24
1

@Dan I compared the first part of your code with bwlabeln and they both give the same result...just so you know. – Benoit_11 Sep 29 '14 at 14:46
@Benoit_11 thanks! I changed it slightly now though but it should be the same just linearised (flattened) – Dan Sep 29 '14 at 14:48
@Alarik I optimized my code slightly further, not sure if that makes a difference to your time tests? – Dan Sep 29 '14 at 14:49
@Dan nicely done! I think you need to reshape cumsum(s_ind(:)) so that its size is 9x5 though. When I replace your 3rd line with this: labelled = (reshape(cumsum(s_ind(:)),size(myLogicals,1),size(myLogicals,2))).*myLogicals it works. – Benoit_11 Sep 29 '14 at 14:55
1

@Benoit_11 actually I just left out a `(:)` on `myLogicals`, try it now. It should be faster than `reshaping` and works without it... – Dan Sep 29 '14 at 14:56
@Dan Yep you're right! I didn't mean to be a bug haha :) I definitely need to start using cumsum and accumarray more often. +1 – Benoit_11 Sep 29 '14 at 14:58
@Divakar that is often faster, I never remember that! – Dan Sep 29 '14 at 15:11

score 1 · Answer 2 · answered Sep 29 '14 at 13:39

1

Use bwlabeln with a vertical connectivity:

lb = bwlabeln( myLogicals, [0 1 0; 0 1 0; 0 1 0] );

Now you have a label 1..11 for each region.

To get max value you can use regionprops

props = regionprops( lb, myArray, 'MaxIntensity' );
finalAnswers = [props.MaxIntensity];

You can use regionprops to get some other properties of each subset, but it is not too general.
If you wish to apply a more general function to each region, e.g., median, you can use accumarray:

finalAnswer = accumarray( lb( myLogicals ), myArray( myLogicals ), [], @median );

answered Sep 29 '14 at 13:39

Shai

111,146
38
238
371

Very nice solution, though it is a bit slow. – MeMyselfAndI Sep 29 '14 at 14:10
2

I tried running the solutions provided on some of my actual data (a 1510 x 2185 matrix). After about 10 runs each, these were the average results: `for` loop, about 0.018 seconds; Dan's solution, about 0.057 seconds; Shai's solution, about 0.150 seconds. So it appears that the `for` loop solution is the most efficeint of these for my specific data and/or computer setup – Alarik Sep 29 '14 at 14:22
@Alarik That is often the case. I imagine there is a smarter (faster) way of getting `labelled` in my solution, but for-loops often win since Matlab started using it's JIT-compiler. The real question is do you really need to improve on your for-loop solution or are you vectorizing for the sake of it? – Dan Sep 29 '14 at 14:25
@Dan I'm still new to Matlab and seeing as a lot of what I'm working on will be dealing with matrices and matrix subsets (like the question here), I was wondering if there was a more efficient approach than a `for` loop for those cases – Alarik Sep 29 '14 at 14:30
@Alarik since most of the time in my solution is spent computing `lb`, if you have several computations to perform with the same `logicals` mask, the "heavy" labeling can be done once and be re-used for other computations as well. – Shai Sep 29 '14 at 20:32

Divakar · Answer 3 · 2014-09-29T23:33:19.053

Ideas behind vectorization and optimization

One of the approaches that one can employ to vectorize this problem would be to convert the subsets into regular shaped blocks and then finding the max of the elements of the those blocks in one go. Now, converting to regular shaped blocks has one issue here and it is that the subsets are unequal in lengths. To avoid this issue, one can create a 2D matrix of indices starting from each of starts elements and extending until the maximum of the subset lengths. Good thing about this is, it allows vectorization, but at the cost of more memory requirements which would depend on the scattered-ness of the subsets lengths.

Another issue with this vectorization technique would be that it could potentially lead to out-of-limits indices creations for final subsets. To avoid this, one can think of two possible ways -

Use a bigger input array by extending the input array such that maximum of the subset lengths plus the starts indices still lie within the confinements of the extended array.
Use the original input array for starts until we are within the limits of original input array and then for the rest of the subsets use the original loop code. We can call it the mixed programming just for the sake of having a short title. This would save us memory requirements on creating the extended array as discussed in the other approach earlier.

These two ways/approaches are listed next.

Approach #1: Vectorized technique

[m,n] = size(myArray); %// store no. of rows and columns in input array

intv = ends-starts; %// intervals
max_intv = max(intv); %// max interval
max_intv_arr = [0:max_intv]'; %//'# array of max indices extent

[row1,col1] = ind2sub([m n],starts); %// get starts row and column indices

m_ext = max(row1+max_intv); %// no. of rows in extended input array

myArrayExt(m_ext,n)=0; %// extended form of input array
myArrayExt(1:m,:) = myArray;

%// New linear indices for extended form of input array
idx = bsxfun(@plus,max_intv_arr,(col1-1)*m_ext+row1); 

%// Index into extended array; select only valid ones by setting rest to nans
selected_ele = myArrayExt(idx);                  
selected_ele(bsxfun(@gt,max_intv_arr,intv))= nan;

%// Get the max of the valid ones for the desired output
out = nanmax(selected_ele);   %// desired output

Approach #2: Mixed programming

%// PART - I: Vectorized technique for subsets that when normalized
%// with max extents still lie within limits of input array
intv = ends-starts; %// intervals
max_intv = max(intv); %// max interval

%// Find the last subset that when extended by max interval would still
%// lie within the limits of input array
starts_extent = find(starts+max_intv<=numel(myArray),1,'last');
max_intv_arr = [0:max_intv]'; %//'# Array of max indices extent

%// Index into extended array; select only valid ones by setting rest to nans
selected_ele = myArray(bsxfun(@plus,max_intv_arr,starts(1:starts_extent)));
selected_ele(bsxfun(@gt,max_intv_arr,intv(1:starts_extent))) = nan;

out(numel(starts)) = 0; %// storage for output
out(1:starts_extent) = nanmax(selected_ele); %// output values for part-I

%// PART - II: Process rest of input array elements
for n = starts_extent+1:numel(starts)
    out(n) = max(myArray(starts(n):ends(n)));
end

Benchmarking

In this section we will compare the the two approaches and the original loop code against each other for performance. Let's setup codes before starting the actual benchmarking -

N = 10000; %// No. of subsets
M1 = 1510; %// No. of rows in input array
M2 = 2185; %// No. of cols in input array
myArray = rand(M1,M2);  %// Input array
num_runs = 50; %// no. of runs for each method

%// Form the starts and ends by getting a sorted random integers array from
%// 1 to one minus no. of elements in input array. That minus one is
%// compensated later on into ends because we don't want any subset with
%// starts and ends as the same index
y1 = reshape(sort(randi(numel(myArray)-1,1,2*N)),2,[]);
starts = y1(1,:);
ends = y1(1,:)+1;

%// Remove identical starts elements
invalid = [false any(diff(starts,[],2)==0,1)];
starts = starts(~invalid);
ends = ends(~invalid);

%// Create myLogicals
myLogicals = false(size(myArray));
for k1=1:numel(starts)
    myLogicals(starts(k1):ends(k1))=1;
end

clear invalid y1 k1 M1 M2 N %// clear unnecessary variables

%// Warm up tic/toc.
for k = 1:100
    tic(); elapsed = toc();
end

Now, the placebo codes that gets us the runtimes -

disp('---------------------- With Original loop code')
tic
for iter = 1:num_runs
    %// ...... approach #1 codes
end
toc
%// clear out variables used in the above approach
%// repeat this for approach #1,2

Benchmark Results

In your comments, you mentioned using 1510 x 2185 matrix, so let's do two case runs with such size and subsets of size 10000 and 2000.

Case 1 [Input - 1510 x 2185 matrix, Subsets - 10000]

---------------------- With Original loop code
Elapsed time is 15.625212 seconds.
---------------------- With Approach #1
Elapsed time is 12.102567 seconds.
---------------------- With Approach #2
Elapsed time is 0.983978 seconds.

Case 2 [Input - 1510 x 2185 matrix, Subsets - 2000]

---------------------- With Original loop code
Elapsed time is 3.045402 seconds.
---------------------- With Approach #1
Elapsed time is 11.349107 seconds.
---------------------- With Approach #2
Elapsed time is 0.214744 seconds.

Case 3 [Bigger Input - 3000 x 3000 matrix, Subsets - 20000]

---------------------- With Original loop code
Elapsed time is 12.388061 seconds.
---------------------- With Approach #1
Elapsed time is 12.545292 seconds.
---------------------- With Approach #2
Elapsed time is 0.782096 seconds.

Note that the number of runs num_runs was varied to keep the runtime of the fastest approach close to 1 sec.

Conclusions

So, I guess the mixed programming (approach #2) is the way to go! As future work, one can use standard deviation into the scattered-ness criteria if the performance suffers because of the scattered-ness and offload the work for most scattered subsets (in terms of their lengths) into the loop code.

Wow, thanks so much for this answer! I'm going to need some time to go through it (to make sure I can understand everything that's going on) — Alarik, Sep 29 '14 at 23:58

user3666197 · Answer 4 · 2014-09-30T01:15:38.013

Efficiency

Measure both the vectorised & for-loop code samples on your respective platform ( be it a <localhost> or Cloud-based ) to see the difference:

MATLAB:7> tic();max( myArray( startIndex(:):endIndex(:) ) );toc() %% Details
Elapsed time is 0.0312 seconds.                                   %% below.
                                                                  %% Code is not
                                                                  %% the merit,
                                                                  %% method is:

and

tic();                                                            %% for/loop
for n = 1:length( startIndex )                                    %% may be
    max( myArray( startIndex(n):endIndex(n) ) );                  %% significantly
end                                                               %% faster than
toc();                                                            %% vectorised
Elapsed time is 0.125 seconds.                                    %% setup(s)
                                                                  %% overhead(s)
%% As commented below,
%% subsequent re-runs yield unrealistic results due to caching artifacts
Elapsed time is 0 seconds.
Elapsed time is 0 seconds.
Elapsed time is 0 seconds.

%% which are not so straight visible if encapsulated in an artificial in-vitro
%% via an outer re-run repetitions ( for k=1:1000 ) et al ( ref. in text below )

For a better interpretation of the test results, rather test on much larger sizes than just on a few tens of row/cols.

EDIT: An erroneous code removed, thanks Dan for the notice. Having taken more attention to emphasize the quantitative validation, that may prove the assumption that a vectorised code may, but need not in all circumstances, be faster is not an excuse for a faulty code, sure.

Output - quantitatively comparative data:

While recommended, there is not IMHO fair to assume, the memalloc and similar overheads to be excluded from the in-vivo testing. Test re-runs typically show VM-page hits improvements, other caching artifacts, while the raw 1st "virgin" run is what typically appears in the real code deployment ( excl. external iterators, for sure ). So consider the results with care and retest in your real environment ( sometimes being run as a Virtual Machine inside a bigger system -- that also makes VM-swap mechanics necessary to take into account once huge matrices start hurt on real-life memory-access patterns ).

On other Projects I am used to use [usec] granularity of the realtime test timing, but the more care is necessary to be taken into account about the test-execution conditions and O/S background.

So nothing but testing gives relevant answers to your specific code/deployment situation, however be methodic to compare data comparable in principle.

Alarik's code:

MATLAB:8> tic();   for k=1:1000                  % ( flattens memalloc issues & al )
>                      for n = 1:length( startIndex )
>                          max( myArray( startIndex(n):endIndex() ) );
>                      end;
>                  end; toc()
Elapsed time is 0.2344 seconds.
%%      time is 0.0002 seconds per k-for-loop <--[ ref.^ remarks on testing ]

Dan's code:

MATLAB:9> tic();   for k=1:1000
>                      s_ind( size( myLogicals ) ) = 0;
>                      s_ind( startIndex ) = 1;
>                      labelled = cumsum( s_ind(:) ).*myLogicals(:);
>                      result = accumarray( labelled + 1, myArray(:), [], @max );
>                  end; toc()
error: product: nonconformant arguments (op1 is 43x1, op2 is 45x1)
%%
%% [Work in progress] to find my mistake -- sorry for not being able to reproduce
%% Dan's code and to make it work
%%
%% Both myArray and myLogicals shape was correct ( 9 x 5 )

...but the code is still erroneous? Using the `:` operator on non-scalars in Matlab doesn't do what i think you think it does. As far as I know, it only considers the first element of each, which is why you get a scalar result... — Dan, Sep 29 '14 at 14:52
Yes, Dan, in principle still erroneous. ( Forgive numpy-alike naive matrix slicing ). **The point is about surprises one may face once measuring the real effect of vectorisation** ( some artificially vectorised and indirect access mapping constructions may have much bigger overhead, than just the plain for/loop ). **Some observations show even 10x bigger code-execution times** on such for/loop-substitutes compared to the "slow"-loop-ing. — user3666197, Sep 29 '14 at 15:13
Yes but using an incorrect answer to illustrate this is counterproductive. Also illustrating timing differences on a single run (i.e. not in a large loop or using the `timeit` function) also doesn't make much sense I'm afraid. If you really want to illustrate your point then I suggest you run the timing on some of the working vectorized solutions. Alarik has already indicated that some (if not all) are slower than the loop! That way your answer is giving useful information AND illustrating your (valid) point without using erroneous code. But remember to loop: `tic;for k=1:10000 ... end;toc;` — Dan, Sep 29 '14 at 15:17