8

I have some Matlab code which needs to be speeded up. Through profiling, I've identified a particular function as the culprit in slowing down the execution. This function is called hundreds of thousands of times within a loop.

My first thought was to convert the function to mex (using Matlab Coder) to speed it up. However, common programming sense tells me the interface between Matlab and the mex code would lead to some overhead, which means calling this mex function thousands of times might not be a good idea. Is this correct? Or does Matlab do some magic when it's the same mex being called repeatedly to remove the overhead?

If there is significant overhead, I'm thinking of restructuring the code so as to add the loop to the function itself and then creating a mex of that. Before doing that, I would like to validate my assumption to justify the time spent on this.

Update:

I tried @angainor's suggestion, and created donothing.m with the following code:

function nothing = donothing(dummy) %#codegen
nothing = dummy;
end

Then, I created a mex function from this as donothing_mex, and tried the following code:

tic;
for i=1:1000000
    donothing_mex(5);
end
toc;

The result was that a million calls to the function took about 9 seconds. This is not a significant overhead for our purposes, so for now I think I will convert the called function alone to mex. However, calling a function from a loop that executes about a million times does seem a pretty stupid idea in retrospect, considering this is performance critical code, so moving the loop to the mex function is still in the books, but with much lesser priority.

Sundar R
  • 13,776
  • 6
  • 49
  • 76
  • 1
    Definitely move the loop to the code. It should require only 2 or 3 extra lines of mex code and will save you probably 8.5 seconds out of that 9. – twerdster Oct 17 '12 at 07:37
  • 2
    Judging from your comments on @angainor's answer, the approach you're taking has an aftertaste of an [XY problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem), e.g., the MEX you want to create to solve a performance issue *might* just have a much faster solution in Matlab, just one you haven't thought of before. Can you perhaps post the essence of the calculations you want to perform in the loop you have now? – Rody Oldenhuis Oct 17 '12 at 08:03
  • @RodyOldenhuis That is also a good point. [Premature optimization is the root of all evil](http://shreevatsa.wordpress.com/2008/05/16/premature-optimization-is-the-root-of-all-evil/) ;) – angainor Oct 17 '12 at 08:54
  • @RodyOldenhuis It's basically an implementation of the algorithm given here: http://ginstrom.com/scribbles/2007/12/01/fuzzy-substring-matching-with-levenshtein-distance-in-python/ The only change from that code is that instead of having two variables row1 and row2 I have a single matrix 'dist' containing the Levenshtein distances matrix. – Sundar R Oct 17 '12 at 20:21

3 Answers3

5

As usual, it all depends on the amount of work you do in the MEX file.. The overhead of calling MEX function is constant and does not depend on e.g., the problem size. It means that arguments are not copied to new, temporary arrays. Hence, if it is enough work, the MATLAB overhead of calling the MEX file will not show. Anyway, in my experience the MEX call overhead is significant only for the first time the mex function is called - the dynamic library has to be loaded, symbols resolved etc. Subsequent MEX calls have very little overhead and are very efficient.

Almost everything in MATLAB is connected with some overhead due to the nature of this high-level language. Unless you have a code, which you are sure is fully compiled with JIT (but then you do not need a mex file :)) So you have a choice of one overhead over the other..

So sum up - I would not be too scared of MEX calling overhead.

Edit As often heard here and elsewhere, the only reasonable thing to do in any particular case is of course BENCHMARK and check it for your self. You can easily estimate the MEX call overhead by writing a trivial MEX function:

#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[ ], int nrhs, const mxArray *prhs[ ]) 
{      
}

On my computer you get

tic; for i=1:1000000; mexFun; end; toc
Elapsed time is 2.104849 seconds.

That is 2e-6s overhead per MEX call. Add your code, time it and see, if the overhead is at acceptable level, or not.

As Andrew Janke noted below (thanks!), the MEX function overhead apparently depends on the number of arguments you pass to the MEX function. It is a small dependence, but it is there:

a = ones(1000,1);
tic; for i=1:1000000; mexFun(a); end; toc
Elapsed time is 2.41 seconds.

It is not related to size of a:

a = ones(1000000,1);
tic; for i=1:1000000; mexFun(a); end; toc
Elapsed time is 2.41805 seconds.

But it is related to the number of arguments

a = ones(1000000,1);
b = ones(1000000,1);
tic; for i=1:1000000; mexFun(a, b); end; toc
Elapsed time is 2.690237 seconds.

So you might want to take that into account in your tests.

angainor
  • 11,760
  • 2
  • 36
  • 56
  • The code inside is pretty small actually - basically, it's just a Levenshtein distance calculator function. – Sundar R Oct 16 '12 at 19:16
  • @sundar I can not tell you exactly what you can expect, since I do not know the code and problems you deal with. If you don't want to invest too much time, why don't you write a trivial mex file, which does nothing more than add two numbers, or returns `x+1`, and measure the overhead? As I said, it is constant, but obviously depends on your hardware/os/matlab version. You will be able to add it to the performance of your C code and see what to expect in MATLAB+MEX. – angainor Oct 16 '12 at 19:21
  • I have updated the question with the overhead measurement, and have decided to postpone moving the loop for now. I'll wait till tomorrow just in case other answers come up, and will accept your answer after that. – Sundar R Oct 16 '12 at 19:40
  • @sundar I just did the test myself :) 2 seconds for 1e6 calls.. Well, it is definitely not nothing. And it is more than the overhead of calling a normal matlab function(!) if you use JIT. So you do need to get some speedup because of your mex file... – angainor Oct 16 '12 at 19:45
  • FYI, have a look at http://stackoverflow.com/questions/1693429/is-matlab-oop-slow-or-am-i-doing-something-wrong; I ended up doing a similar test of function call overhead for it. – Andrew Janke Oct 16 '12 at 23:08
  • 1
    Also note that the overhead is going to be constant with respect to array size, but may increase with the number of function arguments; the raw data is shared with copy-on-write, but a new "header" part of the mxarray data structure is built and passed to the function, at least in the versions I've looked at. (This may be defensive on Matlab's part and may explain some of the higher MEX file overhead.) – Andrew Janke Oct 16 '12 at 23:11
  • @AndrewJanke Very good point, I have updated my answer. Thanks. Also, a good work you did on various method overheads in matlab. – angainor Oct 17 '12 at 08:17
  • I did some calculations and realized this code runs for about 3e11 iterations, so the overhead would theoretically add up to days! This is a three level nested loop where the outermost one is a parfor, so I'm moving the inner two loops into the code to be mex-ified. I also benchmarked different versions of my function itself: the basic Matlab version, the mex-ified version, the version with loop inside the mex code, and a version where the code directly appears inside the loop instead of a function call. The loop-inside version won hands-down, so I'm going ahead with implementing it. Thanks! – Sundar R Oct 18 '12 at 18:26
  • @sundar Great, although it would be much better for performance to implement this in C yourself :) but since you can not do it, well.. Maybe you can post your detailed benchmarks - they would be useful. – angainor Oct 18 '12 at 19:05
2

You should absolutely without any hesitation move the loop inside the mex file. The example below demonstrates a 1000 times speedup for a virtually empty work unit in a for loop. Obviously as the amount of work in the for loop changes this speedup will decrease.

Here is an example of the difference:

Mex function without internal loop:

#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[ ], int nrhs, const mxArray *prhs[ ]) 
{      
    int i=1;    
    plhs[0] = mxCreateDoubleScalar(i);
}

Called in Matlab:

tic;for i=1:1000000;donothing();end;toc
Elapsed time is 3.683634 seconds.

Mex function with internal loop:

#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[ ], int nrhs, const mxArray *prhs[ ]) 
{      
    int M = mxGetScalar(prhs[0]);
    plhs[0] = mxCreateNumericMatrix(M, 1, mxDOUBLE_CLASS, mxREAL);
    double* mymat = mxGetPr(plhs[0]);
    for (int i=0; i< M; i++)
        mymat[i] = M-i;
}

Called in Matlab:

tic; a = donothing(1000000); toc
Elapsed time is 0.003350 seconds.
twerdster
  • 4,977
  • 3
  • 40
  • 70
  • 1
    While you are right of course that in this case you get a lot faster code, the question was different. It was about the overhead of calling a MEX function. Your speedup can completely disappear if in the MEX file you do more work than just creating a scalar. It is a clear trade-off and your answer only applies to your special scenario. If one call to the mex file takes 0.01s, you don't care where the loop is at all. Also, it may not be at all trivial to move **any** loop into the MEX file. – angainor Oct 17 '12 at 07:59
  • Of course your speedup can disappear if you do more work in the mex file but thats irrelevant. The question referred to mex file overhead. I demonstrated that if you call a mex file 1e6 times in a loop in matlab as opposed to calling the same functionality in inside a loop in the mex file itself you will suffer mex file call overhead. Im not sure why you would think otherwise. – twerdster Oct 17 '12 at 08:15
  • I am not arguing with that. I have given absolute overhead estimates as well. OP clearly said that before he decides to move the loop into the MEX file he wants to be sure if it is worth the effort. According to you, is it always worth it? Because I think it depends on 1) the amount of work he does in 1 iteration of the loop and 2) the amount of programing work necessary to move the loop structure to MEX function. It is not true that it takes in general only 2 more lines to do that. It takes 2 lines in your example. – angainor Oct 17 '12 at 08:21
  • 1
    Just for completeness: the difference you are seeing is mostly due to Matlab's JIT compiler being unable to compile the loop due to the non-builtin mex function. If you execute `tic;for i=1:1000000;i=1;end;toc` (i.e., the same functionality, but then JIT'able), it takes 0.005825 seconds on my machine. This is exactly why loops should always be transferred to the mex. – Rody Oldenhuis Oct 18 '12 at 09:09
2

Well, this is the fastest I can make it in Matlab:

%#eml
function L = test(s,t)

    m = numel(s);
    n = numel(t);

    % trivial cases
    if m==0 && n==0
        L = 0; return; end
    if n==0
        L = m; return; end
    if m==0
        L = n; return; end

    % non-trivial cases
    M = zeros(m+1,n+1);    
    M(:,1) = 0:m;

    for j = 2:n+1
        for i = 2:m+1
            M(i,j) = min([
                M(i-1,j) + 1
                M(i,j-1) + 1
                M(i-1,j-1) + (s(i-1)~=t(j-1));
                ]);
        end
    end

    L = min(M(end,:));

end

Can you compile this and run some tests? (For some weird reason, compilation fails to work on my installation...) Perhaps change %#eml to %#codegen first, if you think that's easier.

NOTE: for the C version, you should also interchange the for-loops, so that the loop over j is the inner one.

Also, the row1 and row2 approach is a lot more memory efficient. If you're going to compile anyway, I'd use that approach.

Rody Oldenhuis
  • 37,726
  • 7
  • 50
  • 96
  • That's almost an exact copy of my code (except we don't need the trivial cases). What does %#eml do in contrast to %#codegen? Good suggestion about the for loop interchanging, thanks (I'm guessing since C is row-major, interchanging would make it more cache-friendly, is that right?) As for row1,row2 approach, would memory efficient also mean faster? – Sundar R Oct 18 '12 at 11:18
  • 1
    @sundar: `%#eml` tells Matlab to use embedded Matlab, a subset of `m` that Matlab can compile directly to machine code. `%#codegen` is used to generate C-code (and I never used it, since it's not in R2010b :). Yes, C is row major. It's a often a major hurdle of translating between C-family and Fortran-family languages. The `row1,row2` approach: it will likely only be a tiny bit faster, it will just use less memory, which might be relevant when comparing larger strings. Also, a word of advice: ***ALWAYS*** implement any trivial cases!! They **will** happen sooner or later, intended or not. – Rody Oldenhuis Oct 18 '12 at 11:26
  • @sundar: But just for my understanding, you're calling this function a zillion times in a loop? Is that what you're trying to do? If that's the case, try copy-pasting this whole code directly into the loop body (with adjustments for correct interfacing of course). That way, Matlab's JIT will compile the whole loop and you might gain a lot of speed (JIT can't handle calls to non-builtin functions inside a loop, so you'll run on "interpretation-speed") – Rody Oldenhuis Oct 18 '12 at 11:31
  • @sundar: Alternatively, allow `test` to accept cell strings, and iterate over their contents inside `test` (a much cleaner approach, acutually :) – Rody Oldenhuis Oct 18 '12 at 11:32
  • Does your last comment mean allowing `test` to accept cell array of strings will help JIT handle it? And yes, that is what I'm doing, calling this function, on an average, about 3e11 times. – Sundar R Oct 18 '12 at 12:17
  • @sundar: 300 billion times *on average*...I thought you were exaggerating :) Well, yes, if `test` accepts cell strings, you add another loop (and some checks etc.) that loops through the elements in the cell-array(s). In theory, this triple-loop should be small and simple enough to be JIT'ed. It'll still be slower than a MEX will be though, so I'd suggest to do it in C anyway and regard the cell-string solution as an intermediate step. As an aside, to answer your original question: see my comment on @twerdster's answer. – Rody Oldenhuis Oct 18 '12 at 12:43
  • +1 Good job with getting to the bottom of this question :) I'm now having second thoughts about MATLABs JIT and its efficiency after seeing sundars second question today. The generated C-code is rather complex and inefficient.. Is this what JIT is compiling? – angainor Oct 18 '12 at 19:13