I currently use Matlab code
m = m.*(abs(m)>=THRESH)
to set elements of matrix m
to zero that lie within THRESH
either side of zero. This piece of code is called hundreds of thousands of times and therefore is performance critical. Typically [1000, 400] = size(m)
.
I decided to see if performance could be improved by using a mex function. Therefore, with speed optimisation settings used in x64 release compilation, I used the following core C++ code to threshold the matrix:
void thresholdArrayMex(mxArray *inArr, const double THRESH){
double *indata = (double*)mxGetData(inArr);
const mwSize mrows = mxGetM(inArr);
const mwSize ncols = mxGetN(inArr);
const mwSize size = mrows * ncols;
for (mwSize idx = size; idx--; ){
if (fabs(indata[idx]) < THRESH) {
indata[idx] = 0.0;
}
}
}
In the parent function I have used mxCreateSharedDataCopy
as explained here so that I do not need to make a deep copy of the underlying data associated with inArr
(which is passed in from Matlab). Unfortunately the entire mex implementation is on average three times slower. The culprit is the line indata[idx] = 0.0;
. If I literally comment this line out only (keeping the loop and logical comparison in place), the mex file runs ten times faster than the matlab code, thus eliminating from our enquiries any mex overhead or lib linking etc. in the performance degredation.
Does anyone know why there is such a performance hit in assigning elements of a double array to zero? Could it be because memory is not contiguous? Is it related to the way I have accessed the underlying data in-place using mxCreateSharedDataCopy
and Matlab is doing something behind the scenes that might be expensive? I have tried deep copying the array before thresholding, however i) a call to mxDuplicateArray
is far too expensive and ii) the assignment operation is still expensive.
EDIT 1: In response to comments: I am not doing matrix multiplication or any other operation that Matlab is highly optimised for. I do the profiling in Matlab with
t1 = 0;
t2 = 0;
t3 = 0;
N = 10000;
THRESH = 0.4;
ThresholdElementsMex(1, rand(1000, 400)); %call once to mitigate any dynamic loading effects
for i = 1:N
mat1 = 2*(rand(1000, 400)-0.5);
mat2 = mat1;
mat3 = mat1;
atic = tic();
mat1 = ThresholdElementsMex(THRESH, mat1);
ta = toc(atic);
t1 = t1+ta;
btic = tic();
ThresholdElementsMex(THRESH, mat2);
tb = toc(btic);
t2 = t2+tb;
ctic = tic();
mat3 = mat3.*(abs(mat3)>=THRESH);
tc = toc(ctic);
t3 = t3+tc;
end
t1 = t1/N
t2 = t2/N
t3 = t3/N
As it stands, typical average timings using mxCreateSharedDataCopy
are
t1 = 0.0018
t2 = 0.0020
t3 = 0.00094 % matlab is twice as fast
but simply commenting out indata[idx] = 0.0;
gives
t1 = 0.000013 % removing the C++ assignment line reveals its cost
t2 = 0.00038
t3 = 0.00094
All these results have insignificant variance (if I run that timing script several times).
The C++ compiler optimisations I used are Maximize Speed (/O2)
, Yes to Enable Intrinsic Functions (/Oi)
and Favor fast code (/Ot)