Why is a Matlab logical indexing operation for thresholding matrix elements superior in performance to a mex implementation?

Question

I currently use Matlab code

m = m.*(abs(m)>=THRESH)

to set elements of matrix m to zero that lie within THRESH either side of zero. This piece of code is called hundreds of thousands of times and therefore is performance critical. Typically [1000, 400] = size(m).

I decided to see if performance could be improved by using a mex function. Therefore, with speed optimisation settings used in x64 release compilation, I used the following core C++ code to threshold the matrix:

void thresholdArrayMex(mxArray *inArr, const double THRESH){
    double *indata = (double*)mxGetData(inArr);
    const mwSize mrows = mxGetM(inArr);
    const mwSize ncols = mxGetN(inArr);
    const mwSize size = mrows * ncols;

    for (mwSize idx = size; idx--; ){
        if (fabs(indata[idx]) < THRESH) {
            indata[idx] = 0.0;
        }
    }
}

In the parent function I have used mxCreateSharedDataCopy as explained here so that I do not need to make a deep copy of the underlying data associated with inArr (which is passed in from Matlab). Unfortunately the entire mex implementation is on average three times slower. The culprit is the line indata[idx] = 0.0;. If I literally comment this line out only (keeping the loop and logical comparison in place), the mex file runs ten times faster than the matlab code, thus eliminating from our enquiries any mex overhead or lib linking etc. in the performance degredation.

Does anyone know why there is such a performance hit in assigning elements of a double array to zero? Could it be because memory is not contiguous? Is it related to the way I have accessed the underlying data in-place using mxCreateSharedDataCopy and Matlab is doing something behind the scenes that might be expensive? I have tried deep copying the array before thresholding, however i) a call to mxDuplicateArray is far too expensive and ii) the assignment operation is still expensive.

EDIT 1: In response to comments: I am not doing matrix multiplication or any other operation that Matlab is highly optimised for. I do the profiling in Matlab with

t1 = 0;
t2 = 0;
t3 = 0;
N = 10000;
THRESH = 0.4;

ThresholdElementsMex(1, rand(1000, 400)); %call once to mitigate any dynamic loading effects

for i = 1:N
    mat1 = 2*(rand(1000, 400)-0.5);
    mat2 = mat1;
    mat3 = mat1;

    atic = tic();
    mat1 = ThresholdElementsMex(THRESH, mat1);
    ta = toc(atic);
    t1 = t1+ta;

    btic = tic();
    ThresholdElementsMex(THRESH, mat2);
    tb = toc(btic);
    t2 = t2+tb;

    ctic = tic();
    mat3 = mat3.*(abs(mat3)>=THRESH);
    tc = toc(ctic);
    t3 = t3+tc;
end

t1 = t1/N
t2 = t2/N
t3 = t3/N

As it stands, typical average timings using mxCreateSharedDataCopy are

t1 = 0.0018
t2 = 0.0020
t3 = 0.00094 % matlab is twice as fast

but simply commenting out indata[idx] = 0.0; gives

t1 = 0.000013 % removing the C++ assignment line reveals its cost
t2 = 0.00038
t3 = 0.00094

All these results have insignificant variance (if I run that timing script several times).

The C++ compiler optimisations I used are Maximize Speed (/O2), Yes to Enable Intrinsic Functions (/Oi) and Favor fast code (/Ot)

Thanks for the downvote, however just a scalar negative reward is not helpful for future reference. — rnoodle, Mar 29 '18 at 14:13
Because your code is not as good as MATLABs under-the-hood C code. MATLAB is a very mature software, you are not going to achieve faster code in basic operations than what MATLAB does, in fact, with loads of knoledge, you may at some point get to the same speed. — Ander Biguri, Mar 29 '18 at 14:14
[Why is MATLAB so fast in matrix multiplication?](https://stackoverflow.com/q/6058139/5211833) — Adriaan, Mar 29 '18 at 14:19
Ander and Adriaan - Thank you for your suggestions. However - I am not doing matrix multiplication. Your points would be valid if I were doing something very complex like LA operations. In fact I have done LASSO optimisation that only matches Matlab if I compiled with Intel MKL libraries and Eigen. My question is different - I am literally assigning values to an array - it is this line that causes huge performance cost and I am curious why! I think you have misunderstood the nuance here. — rnoodle, Mar 29 '18 at 14:28
How big is the performance hit? Could you give some numbers? Did you time it using `timeit`? That said, `mxCreateSharedDataCopy` is the wrong tool if you want to modify the array. Copying the array is not a huge cost. Note that your M-code statement creates several intermediate arrays. — Cris Luengo, Mar 29 '18 at 14:39
Also: did you compile with optimizations? (Sorry, I need to ask the simple questions too). — Cris Luengo, Mar 29 '18 at 14:41
Please use `timeit(@()ThresholdElementsMex(THRESH, mat2))` to time your function (and the MATLAB statement). It does all the looping and logic for you. — Cris Luengo, Mar 29 '18 at 15:56
There's something strange about trying to do in-place operations (which you are not, by the way) on arrays that have data shared with other arrays. When you do `mat2 = mat1`, the two arrays share data. A copy will be made at some point when you try to write to `mat2`. I would use `mxDuplicateArray` instead. — Cris Luengo, Mar 29 '18 at 16:00
Also note that if you comment out the line with the assignment, the compiler optimizes out the whole loop, since it has no effect. I'm thinking right now that the `if` statement is the one making your function slower than the M-code statement. — Cris Luengo, Mar 29 '18 at 16:05
You need to make sure that you are using the same number of threads as matlab is. Also, an if-clause within a loop is something to be avoided. You could try something like `indata[idx] *= double( fabs(indata[idx]) >= THRESH )`. — AnonSubmitter85, Mar 29 '18 at 16:08
1. The `indata[idx] *= double( fabs(indata[idx]) >= THRESH )` suggestion I think is the solution - the C code is much faster now, and in my original timing script it is now operating at the same order. I think AnonSubmitter85 you can post this as an answer and I'll mark it as a solution 2. Due to ambiguities in data sharing I simply generated different random matrices on each timing iteration and assumed that on average we could achieve a fair comparison. 3. Criss, curiously the matlab statement is an order of magnitude longer than the C function with timeit....? I'm confused now! — rnoodle, Mar 30 '18 at 14:48
@rnoodle For future reference, when responding to someone in comments, if you use the `@username` construct, they will get a notification. Otherwise, they have to check the thread manually and look for new comments, which is probably why the others haven't responded. People often make a comment and then move on, never knowing that the OP has updated their post or asked them a question directly. — AnonSubmitter85, Apr 10 '18 at 02:12

score 1 · Accepted Answer · answered Apr 10 '18 at 02:09

1

An if-clause within a loop is something to be avoided. You could try something like

indata[idx] *= double( fabs(indata[idx]) >= THRESH );

Good luck.

answered Apr 10 '18 at 02:09

AnonSubmitter85

933
7
14

Why is a Matlab logical indexing operation for thresholding matrix elements superior in performance to a mex implementation?

1 Answers1