5

Say I want to multiply each element of a cell array A with a coefficent k. I can do that by:

A = cellfun(@(x) k*x, A, 'UniformOutput', false)

But this is extremely slow. Is there a faster and better way? The cell array elements are variable length vectors, so cell2num doesn't apply.

Edit: Based on fpe's recommendation of a for loop, here is an example benchmark. Starting with this data

A = arrayfun(@(n) rand(n,1), randi(5,1000,1000), 'UniformOutput',false);

The cellfuncall above takes 9.45 seconds, while a for loop:

A2 = cell(size(A));
for i = 1:size(A,1), for j = 1:size(A,2), A2{i,j} = A{i,j}*k; end; end
A = A2;

takes 1.67 seconds, which is a significant improvement. I'd still prefer something a few orders of magnitude faster. (I also don't understand why the Matlab interpreter is unable to make the cellfun call as fast as the for loop. They are semantically identical.)

Edit 2: Amro's suggestion to make one single for loop is significantly faster:

for i = 1:numel(A), A{i} = A{i}*k; end

takes 1.11 seconds, and if I run pack prior to it to align the memory just 0.88 seconds.

Implementing a MEX function to do this is actually not much better: 0.73 seconds, (0.53 seconds after pack), which indicates that allocating many small matrices is slow in Matlab.

#include "mex.h"

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) {
    if (nrhs != 2)
        mexErrMsgTxt("need 2 arguments (Cell, Coefficient)");

    mwSize const* size = mxGetDimensions(prhs[0]);
    int N = mxGetNumberOfDimensions(prhs[0]);

    if (mxGetNumberOfElements(prhs[1]) != 1)
        mexErrMsgTxt("second argument to multcell must be a scalar");

    double coefficient = *mxGetPr(prhs[1]);

    plhs[0] = mxCreateCellArray(N, size);

    int M = mxGetNumberOfElements(prhs[0]);

    for (int i = 0; i < M; i++) {
        mxArray *r = mxGetCell(prhs[0], i);
        mxArray *l = mxCreateNumericArray(mxGetNumberOfDimensions(r),
                                          mxGetDimensions(r),
                                          mxDOUBLE_CLASS,
                                          mxREAL);
        double *rp = mxGetPr(r);
        double *lp = mxGetPr(l);
        int num_elements = mxGetNumberOfElements(r);
        for (int i = 0; i < num_elements; i++)
            lp[i] = rp[i] * coefficient;
        mxSetCell(plhs[0], i, l);
    }
}

Cheating a bit, however, and implementing a MEX function that actually edits the memory in place seems to be the only way to get reasonable performance out the operation: 0.030 seconds. This uses the undocumented mxUnshareArray as suggested by Amro.

#include "mex.h"

extern "C" bool mxUnshareArray(mxArray *array_ptr, bool noDeepCopy);

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) {
    if (nrhs != 2)
        mexErrMsgTxt("need 2 arguments (Cell, Coefficient)");

    mwSize const* size = mxGetDimensions(prhs[0]);
    int N = mxGetNumberOfDimensions(prhs[0]);

    if (mxGetNumberOfElements(prhs[1]) != 1)
        mexErrMsgTxt("second argument to multcell must be a scalar");

    double coefficient = *mxGetPr(prhs[1]);

    mxUnshareArray(const_cast<mxArray *>(prhs[0]), false);
    plhs[0] = const_cast<mxArray *>(prhs[0]);

    int M = mxGetNumberOfElements(prhs[0]);

    for (int i = 0; i < M; i++) {
        mxArray *r = mxGetCell(prhs[0], i);
        double *rp = mxGetPr(r);
        int num_elements = mxGetNumberOfElements(r);
        for (int i = 0; i < num_elements; i++)
            rp[i] = rp[i] * coefficient;
    }
}
Amro
  • 123,847
  • 25
  • 243
  • 454
digitalvision
  • 2,552
  • 19
  • 23
  • 1
    in the newest MATLAB releases, `for` loop are generally the easiest and fastest solution. – fpe Apr 06 '13 at 14:10
  • thank you, a for loop is indeed significantly faster (still very slow though) – digitalvision Apr 06 '13 at 14:25
  • digitalvision: why do you say it is extremely slow? what sizes are we talking about here, and how long is it currently taking (tic/toc)? I doubt such an operation is the bottleneck in your code... Run the profiler and try to optimize other actual hot-spots.. – Amro Apr 06 '13 at 14:27
  • Amro: I have added some benchmark results to the question. This operation is the single dominating bottleneck. I never imagined that the for-loop could be faster though. – digitalvision Apr 06 '13 at 14:41
  • 1
    Matlab does a lot of behind the scenes parallelization when it can. While I find it odd that `cellfun` isn't similarly optimized, I would not be surprised if this accounted for the difference. On that note: you are using cell arrays because your vectors are of different length; have you considered using a matrix and padding shorter vectors with `nan`? I know it's ugly, but if you are that concerned about speed it might be worth trying. – Alan Apr 06 '13 at 14:46
  • 2
    @digitalvision: I see. I wouldn't be too surprised if such straightforward for-loops are more heavily optimized by the JIT accelerator. By the way it could be simplified further as: `for i=1:numel(A), A{i} = A{i}*k; end`. If you are still not satisfied, perhaps a C MEX-file could cut a little bit more time... Or even better, use `parfor` if you have access to the PCT toolbox. I should also mention that sometimes running an M-file function vs. a script or running from the command prompt could makes a difference – Amro Apr 06 '13 at 14:49
  • Another idea is that if you need to perform several operations on those vectors, you could linearize the whole thing into one big matrix: `cell2mat(A(:))`, process it, then put it back together using something like `mat2cell` (having recorded the sizes of each vector beforehand). Of course the conversion overhead would only be justified if you perform several operations on the small vectors. – Amro Apr 06 '13 at 15:00
  • Alan: That is an interesting idea. The problem is that in my case, the vector lengths are almost geometrically distributed without an upper bound. The overhead would thus probably be prohibitive. One other idea would be to concatenate all vectors into a one dimensional array, and in addition keep one matrix of start indices, and another with lengths. – digitalvision Apr 06 '13 at 15:27
  • Amro: Thank you! I edited the question incorporating your suggestions. Seems you have to cheat a bit to get good performance. – digitalvision Apr 06 '13 at 16:17
  • 1
    @digitalvision: editing input arrays in-place is dangerous in MEX functions because of the copy-on-write mechanism, and can possibly crash with a segfault in some cases. There is an undocumented solution by using `mxUnshareArray`: http://undocumentedmatlab.com/blog/matlab-mex-in-place-editing/ – Amro Apr 06 '13 at 16:40
  • Amro: Thank you! Didn't know that there existed a more safe (if undocumented) way to to in-place editing. I have once again updated the question. – digitalvision Apr 06 '13 at 17:03
  • @digitalvision: one final idea: you could perhaps squeeze out a little bit more performance by writing a multi-threaded MEX file. See here for an [example using OpenMP](http://www.walkingrandomly.com/?p=1795). Just keep in mind that the MEX API is not thread-safe, so only the innermost loop (that does multiplication) can be parallelized, but since your vectors are really small (length of 1 to 5), I doubt that will be beneficial in this case.. – Amro Apr 06 '13 at 18:29

1 Answers1

3

Not exactly an answer, but here is a way to see the affect of JIT compiler and accelerator in both approaches (cellfun vs. for-loop):

feature('jit', 'off'); feature('accel', 'off');
tic, A = cellfun(@(x) k*x, A, 'UniformOutput', false); toc
tic, for i=1:numel(A), A{i} = A{i}*k; end, toc

feature('jit', 'on'); feature('accel', 'on');
tic, A = cellfun(@(x) k*x, A, 'UniformOutput', false); toc
tic, for i=1:numel(A), A{i} = A{i}*k; end, toc

I get the following

Elapsed time is 25.913995 seconds.
Elapsed time is 13.050288 seconds.

vs.

Elapsed time is 10.053347 seconds.
Elapsed time is 1.978974 seconds.

with optimization turned on in the second.

By the way, parallel parfor performed much worse (at least on my local test machine with a pool size of 2 processes).

Seeing the results you posted, MEX-function is the way to go :)

Amro
  • 123,847
  • 25
  • 243
  • 454
  • what I'm gonna ask is not related to this question, but maybe you have clue. Do you think that turning the optimization on as you have done would improve in any way a Choleski factorization (built-in function `chol`)? – fpe Apr 06 '13 at 16:56
  • @pe: I think that JIT compilation only affects interpreted MATLAB code (by generating optimized code on the fly), and afaik that would not affect builtin functions. Anyway both optimizations are turned on by default, so you don't have to worry about that... Plus when data meets [certain conditions](http://www.mathworks.com/support/solutions/en/data/1-4PG4AN/), `CHOL` is automatically multi-threaded as well. – Amro Apr 06 '13 at 17:04
  • 1
    @fpe: in the same spirit, you could start MATLAB with `-singleCompThread` option to compare and see the effect of multi-threading. I forgot to mention that most builtin linear algebra functions (like CHOL) use highly optimized implementations of BLAS and LAPACK (Intel MKL library, or the equivalent for AMD processors). So you can rest assure you are getting the best performance.. – Amro Apr 06 '13 at 18:45