107

Consider the following simple speed test for arrayfun:

T = 4000;
N = 500;
x = randn(T, N);
Func1 = @(a) (3*a^2 + 2*a - 1);

tic
Soln1 = ones(T, N);
for t = 1:T
    for n = 1:N
        Soln1(t, n) = Func1(x(t, n));
    end
end
toc

tic
Soln2 = arrayfun(Func1, x);
toc

On my machine (Matlab 2011b on Linux Mint 12), the output of this test is:

Elapsed time is 1.020689 seconds.
Elapsed time is 9.248388 seconds.

What the?!? arrayfun, while admittedly a cleaner looking solution, is an order of magnitude slower. What is going on here?

Further, I did a similar style of test for cellfun and found it to be about 3 times slower than an explicit loop. Again, this result is the opposite of what I expected.

My question is: Why are arrayfun and cellfun so much slower? And given this, are there any good reasons to use them (other than to make the code look good)?

Note: I'm talking about the standard version of arrayfun here, NOT the GPU version from the parallel processing toolbox.

EDIT: Just to be clear, I'm aware that Func1 above can be vectorized as pointed out by Oli. I only chose it because it yields a simple speed test for the purposes of the actual question.

EDIT: Following the suggestion of grungetta, I re-did the test with feature accel off. The results are:

Elapsed time is 28.183422 seconds.
Elapsed time is 23.525251 seconds.

In other words, it would appear that a big part of the difference is that the JIT accelerator does a much better job of speeding up the explicit for loop than it does arrayfun. This seems odd to me, since arrayfun actually provides more information, ie, its use reveals that the order of the calls to Func1 do not matter. Also, I noted that whether the JIT accelerator is switched on or off, my system only ever uses one CPU...

angainor
  • 11,760
  • 2
  • 36
  • 56
Colin T Bowers
  • 18,106
  • 8
  • 61
  • 89
  • 10
    Fortunately, the "standard solution" remains the fastest by far: tic; 3*x.^2+2*x-1; toc Elapsed time is 0.030662 seconds. – Oli Sep 21 '12 at 01:32
  • 4
    @Oli I suppose I should have anticipated that someone would point this out and used a function that couldn't be vectorized :-) – Colin T Bowers Sep 21 '12 at 02:22
  • 3
    I'd be interested to see how this timing changes when the JIT accelerator is turned off. Execute the command 'feature accel off' and then rerun your test. – grungetta Sep 21 '12 at 07:16
  • @grungetta Interesting suggestion. I've added the results to the question along with a few comments. – Colin T Bowers Sep 21 '12 at 07:29
  • let me add this one to the list of related questions: [What is the fastest way to perform arithmetic operations on each element of a cell array?](http://stackoverflow.com/questions/15851718/what-is-the-fastest-way-to-perform-arithmetic-operations-on-each-element-of-a-ce) – Amro Apr 22 '13 at 10:59
  • It is too bad that MATLAB docs gracefully omit these issues. I am using arrayfun to access properties of objects inside object arrays, and sadly it just slows things orders of magnitude... – Alex Kreimer Apr 05 '15 at 12:41

2 Answers2

102

You can get the idea by running other versions of your code. Consider explicitly writing out the computations, instead of using a function in your loop

tic
Soln3 = ones(T, N);
for t = 1:T
    for n = 1:N
        Soln3(t, n) = 3*x(t, n)^2 + 2*x(t, n) - 1;
    end
end
toc

Time to compute on my computer:

Soln1  1.158446 seconds.
Soln2  10.392475 seconds.
Soln3  0.239023 seconds.
Oli    0.010672 seconds.

Now, while the fully 'vectorized' solution is clearly the fastest, you can see that defining a function to be called for every x entry is a huge overhead. Just explicitly writing out the computation got us factor 5 speedup. I guess this shows that MATLABs JIT compiler does not support inline functions. According to the answer by gnovice there, it is actually better to write a normal function rather than an anonymous one. Try it.

Next step - remove (vectorize) the inner loop:

tic
Soln4 = ones(T, N);
for t = 1:T
    Soln4(t, :) = 3*x(t, :).^2 + 2*x(t, :) - 1;
end
toc

Soln4  0.053926 seconds.

Another factor 5 speedup: there is something in those statements saying you should avoid loops in MATLAB... Or is there really? Have a look at this then

tic
Soln5 = ones(T, N);
for n = 1:N
    Soln5(:, n) = 3*x(:, n).^2 + 2*x(:, n) - 1;
end
toc

Soln5   0.013875 seconds.

Much closer to the 'fully' vectorized version. Matlab stores matrices column-wise. You should always (when possible) structure your computations to be vectorized 'column-wise'.

We can go back to Soln3 now. The loop order there is 'row-wise'. Lets change it

tic
Soln6 = ones(T, N);
for n = 1:N
    for t = 1:T
        Soln6(t, n) = 3*x(t, n)^2 + 2*x(t, n) - 1;
    end
end
toc

Soln6  0.201661 seconds.

Better, but still very bad. Single loop - good. Double loop - bad. I guess MATLAB did some decent work on improving the performance of loops, but still the loop overhead is there. If you would have some heavier work inside, you would not notice. But since this computation is memory bandwidth bounded, you do see the loop overhead. And you will even more clearly see the overhead of calling Func1 there.

So what's up with arrayfun? No function inlinig there either, so a lot of overhead. But why so much worse than a double nested loop? Actually, the topic of using cellfun/arrayfun has been extensively discussed many times (e.g. here, here, here and here). These functions are simply slow, you can not use them for such fine-grain computations. You can use them for code brevity and fancy conversions between cells and arrays. But the function needs to be heavier than what you wrote:

tic
Soln7 = arrayfun(@(a)(3*x(:,a).^2 + 2*x(:,a) - 1), 1:N, 'UniformOutput', false);
toc

Soln7  0.016786 seconds.

Note that Soln7 is a cell now.. sometimes that is useful. Code performance is quite good now, and if you need cell as output, you do not need to convert your matrix after you have used the fully vectorized solution.

So why is arrayfun slower than a simple loop structure? Unfortunately, it is impossible for us to say for sure, since there is no source code available. You can only guess that since arrayfun is a general purpose function, which handles all kinds of different data structures and arguments, it is not necessarily very fast in simple cases, which you can directly express as loop nests. Where does the overhead come from we can not know. Could the overhead be avoided by a better implementation? Maybe not. But unfortunately the only thing we can do is study the performance to identify the cases, in which it works well, and those, where it doesn't.

Update Since the execution time of this test is short, to get reliable results I added now a loop around the tests:

for i=1:1000
   % compute
end

Some times given below:

Soln5   8.192912 seconds.
Soln7  13.419675 seconds.
Oli     8.089113 seconds.

You see that the arrayfun is still bad, but at least not three orders of magnitude worse than the vectorized solution. On the other hand, a single loop with column-wise computations is as fast as the fully vectorized version... That was all done on a single CPU. Results for Soln5 and Soln7 do not change if I switch to 2 cores - In Soln5 I would have to use a parfor to get it parallelized. Forget about speedup... Soln7 does not run in parallel because arrayfun does not run in parallel. Olis vectorized version on the other hand:

Oli  5.508085 seconds.
Community
  • 1
  • 1
angainor
  • 11,760
  • 2
  • 36
  • 56
  • 9
    Great answer! And the links to matlab central all provide very interesting reads. Many thanks. – Colin T Bowers Sep 21 '12 at 08:57
  • And an interesting update! This answer just keeps on giving :-) – Colin T Bowers Sep 22 '12 at 02:05
  • 3
    just a small comment; back in MATLAB 6.5, `cellfun` was implemented as a MEX-file (with C source code available beside it). It was actually quite straightforward. Of course it only supported applying one of 6 hard-coded functions (you couldnt pass a function handle, only a string with one the function names) – Amro Apr 22 '13 at 10:57
  • 1
    arrayfun + function handle = slow! avoid them in heavy code. – Yvon Aug 14 '14 at 04:30
  • @Amro If arrayfun and cellfun are implemeted as you said, then it is quite probable, that MATLAB does not use any SIMD or MIMD commands. But it could be, that the JIT-Compiler knows how to, as every good compiler do. That could explain the factor in time difference against "Soln5" and "Oli". – Tik0 Aug 26 '14 at 13:38
  • @Tik0: like I said, that was the case back in MATLAB 6.x (before JIT-Compilation was introduced). I think starting with MATLAB 7, `cellfun` and `arrayfun` became builtin functions (with function-handles support), but I have no idea how they are implemented these days. – Amro Aug 26 '14 at 16:30
  • "You can use them for code brevity and fancy conversions between cells and arrays" I use arrayfun so my functions will easily accept arguments of indeterminent dimension. Easier than building some dimension handling code myself. – Mr Purple Oct 31 '14 at 00:06
-8

That because!!!!

x = randn(T, N); 

is not gpuarray type;

All you need to do is

x = randn(T, N,'gpuArray');
benka
  • 4,732
  • 35
  • 47
  • 58
  • 2
    I think you need to read the question and excellent answer by @angainor a little more carefully. It doesn't have anything to do with `gpuarray`. That is almost certainly why this answer has been downvoted. – Colin T Bowers Aug 14 '14 at 00:38
  • @Colin - I agree angainor's is more thorough, but the answer does not mention 'gpuArray'. I think the 'gpuArray' is a good contribution here (if its correct). Also, the question got a little sloppy with *"What is going on here?"*, so I think it opened the door for additional methods like vectorizing data and shipping it off to a GPU. I'm letting this answer ride because it could add value for future visitors. My apologies if I made the wrong call. – jww Aug 25 '14 at 00:09
  • 1
    You also forget the fact that `gpuarray` is only supported for nVidia graphics cards. Should they not have such hardware, then your advice (or lack of) is meaningless. -1 – rayryeng Jan 23 '15 at 14:30
  • On the other hand, gpuarray is the light saber of matlab vectorized programming. – MrIO Mar 01 '16 at 21:05