Is this code correct to allocate two aligned cv::Mat?

Question

Disclamer: I am a SIMD newbie, but I have a good knowledge of OpenCV.

Let's suppose I have basic OpenCV code.

cv::Mat1f a = cv::Mat(n, n);
cv::Mat1f b = cv::Mat(n, n);
cv::Mat1f x;
//fill a and b somehow
x = a+b;

Now, let's suppose I want to use the code above on AVX2 or even AVX-512 machine. Having the data 32-aligned could be a great benefit for performances. The data above should not be aligned. I'm almost certain of it because in the optimization report files generated by the Intel Compiler it's said that the data is unaligned, but I could be wrong.

So what if I allocate 32-aligned pointers and use them for the operations above? Something like:

float *apt = __mm_alloc(sizeof(float)*m*n, 32);
float *bpt = __mm_alloc(sizeof(float)*m*n, 32);
float *xpt = __mm_alloc(sizeof(float)*m*n, 32);
cv::Mat1f a = cv::Mat(n, m, apt);
cv::Mat1f b = cv::Mat(m, n, bpt);
cv::Mat1f x = cv::Mat(n, n, xpt);
x = a+b;

I see three problems that concerns me in the code above:

Matrix multiplications in cv::Mat returns a new cv::Mat object which, unfortunately, would be not aligned.
Am I reinventing the wheel? I don't know if OpenCV includes already this kind of optimizations, even though I didn't find nothing useful on Google.
Reading a little bit on Intel Intrinsics, when we use __mm_malloc, we should use __mm_free. However, cv::Mat are "kinda" of smart pointers, so they're allocated and de-allocated behind-the-hood, where probably some free or delete function is called.

To overcome problem 1, in very critical sections of code where data alignment could improve the performance, I could get the pointer back and do the sum operation using pointers. Something like:

//do the code above
float *aptr = a.ptr<float>(0);
float *bptr = b.ptr<float>(0);
float *xptr = c.ptr<float>(0);
for(int i=0; i<n ; i++)
  for(int j=0; j<m; j++)
    xptr[i*m+j] = a[i*m+j] + b[i*m+j];

But again, I'm afraid that I'm reinventing the wheel. Notice that this is a toy example to make it MCV, my actual code is much more complicated than this and compiler optimizations may not be obvious.

I compile my code with icpc and the following flags:

INTEL_OPT=-O3 -ipo -simd -xCORE-AVX2 -parallel -qopenmp -fargument-noalias -ansi-alias -no-prec-div -fp-model fast=2 -fma -align -finline-functions
INTEL_PROFILE=-g -qopt-report=5 -Bdynamic -shared-intel -debug inline-debug-info -qopenmp-link dynamic -parallel-source-info=2 -ldl

Looking at [the code](https://github.com/opencv/opencv/blob/26be2402a3ad6c9eacf7ba7ab2dfb111206fffbb/modules/core/include/opencv2/core/private.hpp#L245), it would seem that it should already allocate correctly aligned blocks of memory... Considering there are AVX implementations of some of the algorithms, you'd kinda expect that to be so (IMHO). — Dan Mašek, May 01 '17 at 19:22
@DanMašek thanks for your comment, I was suspecting that this was very basic! Mmmh then I wonder why the intel compiler tells me that the data is unaligned. In addition, if I'm working on a AVX-512 compatible machine (e.g Knights Landing), we would need something like `#define CV_MALLOC_ALIGN 64`, which is not implemented in OpenCV. Do you agree? — cplusplusuberalles, May 01 '17 at 19:27
opencv has SSE support, but I dont know how deep integrated... Maybe have a look at the code. Mat has a member "step" which tells you the number of bytes per Mat row, which might differ from the depth * channels * cols if the memory is aligned already (and cols dont fit the aligbment). Maybe yoz can allocate aligned, with MatAllocator http://docs.opencv.org/ref/master/df/d4c/classcv_1_1MatAllocator.html#a279cc239b836c7f06e60de9d00f8f73f&gsc.tab=0 but ai never used it... — Micka, May 01 '17 at 19:32
IMHO a useful vectorized algorithm should be able to cope with some misalignment (process the beginning without SIMD, then do the aligned part vectorized, and finally handle the trailing bit without SIMD). In practical use you will often need to process a smaller ROI, or have rows not a power of 2, etc. — Dan Mašek, May 01 '17 at 19:38
@DanMašek I'm optimizing [this](https://github.com/perdoch/hesaff) feature detecotr, so I process the whole matrices most of the times. Most of the times rows and cols are not powers of 2, but would I care? If you refer to loop peeling, but partially aligned data would be better than nothing. — cplusplusuberalles, May 01 '17 at 19:43
@Micka I use only 1-channel matrices. But btw I think that the code above should not harm, am I correct? Just to give it a try and be sure that the matrix is 100% sure to be aligned (even 64 aligned for AVX-512 machines) — cplusplusuberalles, May 01 '17 at 19:47
if you provide external data memory to a cv::Mat you should/haveto provide the step variable in thr condtructor. Otherwise opencv might use wrong pixel positions because it assumes them at wrong memory positions. — Micka, May 01 '17 at 19:52
@Micka I think I found out the problem. Please see [this](http://stackoverflow.com/questions/43726871/convert-cvmat-to-stdvector-without-copying) question on the topic. — cplusplusuberalles, May 01 '17 at 21:34
@DanMašek if you too would give a look at the question that I linked in my last comment I would really appreciate that. — cplusplusuberalles, May 01 '17 at 21:35

Is this code correct to allocate two aligned cv::Mat?

0 Answers0

Linked