6

I started a similar question on another thread, but then I was focusing on how to use OpenCV. Having failed to achieve what I originally wanted, I will ask here exactly what I want.

I have two matrices. Matrix a is 2782x128 and Matrix b is 4000x128, both unsigned char values. The values are stored in a single array. For each vector in a, I need the index of the vector in b with the closest euclidean distance.

Ok, now my code to achieve this:

#include <windows.h>
#include <stdlib.h>
#include <stdio.h>
#include <cstdio>
#include <math.h>
#include <time.h>
#include <sys/timeb.h>
#include <iostream>
#include <fstream>
#include "main.h"

using namespace std;

void main(int argc, char* argv[])
{
    int a_size;
    unsigned char* a = NULL;
    read_matrix(&a, a_size,"matrixa");
    int b_size;
    unsigned char* b = NULL;
    read_matrix(&b, b_size,"matrixb");

    LARGE_INTEGER liStart;
    LARGE_INTEGER liEnd;
    LARGE_INTEGER liPerfFreq;
    QueryPerformanceFrequency( &liPerfFreq );
    QueryPerformanceCounter( &liStart );

    int* indexes = NULL;
    min_distance_loop(&indexes, b, b_size, a, a_size);

    QueryPerformanceCounter( &liEnd );

    cout << "loop time: " << (liEnd.QuadPart - liStart.QuadPart) / long double(liPerfFreq.QuadPart) << "s." << endl;

    if (a)
    delete[]a;
if (b)
    delete[]b;
if (indexes)
    delete[]indexes;
    return;
}

void read_matrix(unsigned char** matrix, int& matrix_size, char* matrixPath)
{
    ofstream myfile;
    float f;
    FILE * pFile;
    pFile = fopen (matrixPath,"r");
    fscanf (pFile, "%d", &matrix_size);
    *matrix = new unsigned char[matrix_size*128];

    for (int i=0; i<matrix_size*128; ++i)
    {
        unsigned int matPtr;
        fscanf (pFile, "%u", &matPtr);
        matrix[i]=(unsigned char)matPtr;
    }
    fclose (pFile);
}

void min_distance_loop(int** indexes, unsigned char* b, int b_size, unsigned char* a, int a_size)
{
    const int descrSize = 128;

    *indexes = (int*)malloc(a_size*sizeof(int));
    int dataIndex=0;
    int vocIndex=0;
    int min_distance;
    int distance;
    int multiply;

    unsigned char* dataPtr;
    unsigned char* vocPtr;
    for (int i=0; i<a_size; ++i)
    {
        min_distance = LONG_MAX;
        for (int j=0; j<b_size; ++j)
        {
            distance=0;
            dataPtr = &a[dataIndex];
            vocPtr = &b[vocIndex];

            for (int k=0; k<descrSize; ++k)
            {
                multiply = *dataPtr++-*vocPtr++;
                distance += multiply*multiply;
                // If the distance is greater than the previously calculated, exit
                if (distance>min_distance)
                    break;
            }

            // if distance smaller
            if (distance<min_distance)
            {
                min_distance = distance;
                (*indexes)[i] = j;
            }
            vocIndex+=descrSize;
        }
        dataIndex+=descrSize;
        vocIndex=0;
    }
}

And attached are the files with sample matrices.

matrixa matrixb

I am using windows.h just to calculate the consuming time, so if you want to test the code in another platform than windows, just change windows.h header and change the way of calculating the consuming time.

This code in my computer is about 0.5 seconds. The problem is that I have another code in Matlab that makes this same thing in 0.05 seconds. In my experiments, I am receiving several matrices like matrix a every second, so 0.5 seconds is too much.

Now the matlab code to calculate this:

aa=sum(a.*a,2); bb=sum(b.*b,2); ab=a*b'; 
d = sqrt(abs(repmat(aa,[1 size(bb,1)]) + repmat(bb',[size(aa,1) 1]) - 2*ab));
[minz index]=min(d,[],2);

Ok. Matlab code is using that (x-a)^2 = x^2 + a^2 - 2ab.

So my next attempt was to do the same thing. I deleted my own code to make the same calculations, but It was 1.2 seconds approx.

Then, I tried to use different external libraries. The first attempt was Eigen:

const int descrSize = 128;
MatrixXi a(a_size, descrSize);
MatrixXi b(b_size, descrSize);
MatrixXi ab(a_size, b_size);

unsigned char* dataPtr = matrixa;
for (int i=0; i<nframes; ++i)
{
    for (int j=0; j<descrSize; ++j)
    {
        a(i,j)=(int)*dataPtr++;
    }
}
unsigned char* vocPtr = matrixb;
for (int i=0; i<vocabulary_size; ++i)
{
    for (int j=0; j<descrSize; ++j)
    {
        b(i,j)=(int)*vocPtr ++;
    }
}
ab = a*b.transpose();
a.cwiseProduct(a);
b.cwiseProduct(b);
MatrixXi aa = a.rowwise().sum();
MatrixXi bb = b.rowwise().sum();
MatrixXi d = (aa.replicate(1,vocabulary_size) + bb.transpose().replicate(nframes,1) - 2*ab).cwiseAbs2();

int* index = NULL;
index = (int*)malloc(nframes*sizeof(int));
for (int i=0; i<nframes; ++i)
{
    d.row(i).minCoeff(&index[i]);
}

This Eigen code costs 1.2 approx for just the line that says: ab = a*b.transpose();

A similar code using opencv was used also, and the cost of the ab = a*b.transpose(); was 0.65 seconds.

So, It is real annoying that matlab is able to do this same thing so quickly and I am not able in C++! Of course being able to run my experiment would be great, but I think the lack of knowledge is what really is annoying me. How can I achieve at least the same performance than in Matlab? Any kind of soluting is welcome. I mean, any external library (free if possible), loop unrolling things, template things, SSE intructions (I know they exist), cache things. As I said, my main purpose is increase my knowledge for being able to code thinks like this with a faster performance.

Thanks in advance

EDIT: more code suggested by David Hammen. I casted the arrays to int before making any calculations. Here is the code:

void min_distance_loop(int** indexes, unsigned char* b, int b_size, unsigned char* a, int a_size)
{
    const int descrSize = 128;

    int* a_int;
    int* b_int;

    LARGE_INTEGER liStart;
    LARGE_INTEGER liEnd;
    LARGE_INTEGER liPerfFreq;
    QueryPerformanceFrequency( &liPerfFreq );
    QueryPerformanceCounter( &liStart );

    a_int = (int*)malloc(a_size*descrSize*sizeof(int));
    b_int = (int*)malloc(b_size*descrSize*sizeof(int));

    for(int i=0; i<descrSize*a_size; ++i)
        a_int[i]=(int)a[i];
    for(int i=0; i<descrSize*b_size; ++i)
        b_int[i]=(int)b[i];

    QueryPerformanceCounter( &liEnd );

    cout << "Casting time: " << (liEnd.QuadPart - liStart.QuadPart) / long double(liPerfFreq.QuadPart) << "s." << endl;

    *indexes = (int*)malloc(a_size*sizeof(int));
    int dataIndex=0;
    int vocIndex=0;
    int min_distance;
    int distance;
    int multiply;

    /*unsigned char* dataPtr;
    unsigned char* vocPtr;*/
    int* dataPtr;
    int* vocPtr;
    for (int i=0; i<a_size; ++i)
    {
        min_distance = LONG_MAX;
        for (int j=0; j<b_size; ++j)
        {
            distance=0;
            dataPtr = &a_int[dataIndex];
            vocPtr = &b_int[vocIndex];

            for (int k=0; k<descrSize; ++k)
            {
                multiply = *dataPtr++-*vocPtr++;
                distance += multiply*multiply;
                // If the distance is greater than the previously calculated, exit
                if (distance>min_distance)
                    break;
            }

            // if distance smaller
            if (distance<min_distance)
            {
                min_distance = distance;
                (*indexes)[i] = j;
            }
            vocIndex+=descrSize;
        }
        dataIndex+=descrSize;
        vocIndex=0;
    }
}

The entire process is now 0.6, and the casting loops at the beginning are 0.001 seconds. Maybe I did something wrong?

EDIT2: Anything about Eigen? When I look for external libs they always talk about Eigen and their speed. I made something wrong? Here a simple code using Eigen that shows it is not so fast. Maybe I am missing some config or some flag, or ...

MatrixXd A = MatrixXd::Random(1000, 1000);
MatrixXd B = MatrixXd::Random(1000, 500);
MatrixXd X;

This code is about 0.9 seconds.

Cœur
  • 37,241
  • 25
  • 195
  • 267
min.yong.yoon
  • 490
  • 4
  • 13
  • You compiled all tests in release mode? – Denis Ermolin Sep 26 '12 at 08:30
  • You may find it annoying that Matlab outperforms your code, but I, who make heavy use of Matlab, find it very satisfying. I don't have much in the way of concrete advice for you but the key to improving the performance of this type of code is often to make optimal (or at least very good) use of the memory hierarchy on modern CPUs. Another factor to consider is that much of Matlab's core functionality is now multi-threaded for execution on multi-core CPUs, it's not clear to me that any of your own code is multi-threaded; that would probably have some impact on performance. – High Performance Mark Sep 26 '12 at 08:38
  • I don't know how to help you make your C/C++ code faster (your code looks more like C than C++. Evidence: `malloc`) yet, but I can see how you can make your Matlab code faster: Eliminate the `sqrt`. Given two non-negative numbers a and b, sqrt(a)>sqrt(b) ⇔ a>b. – David Hammen Sep 26 '12 at 09:04
  • 1
    Denis Ermolin, yes, In Debug mode is about 2.5 seconds. High Performance Mark, you are right, when working with Matlab is satisfying, but now I have to make a real implementation of the matlab prototyped code. David Hammen, I know that. If you see the C++ code, I avoided the sqrt. I also tried to avoid the multiply*multiply by using distance+=abs(multiply). Result? Worse. About 0.8 seconds. – min.yong.yoon Sep 26 '12 at 09:53
  • Yes, the malloc is a remnant from the past lol, sorry. But it is ok to the performance right? – min.yong.yoon Sep 26 '12 at 10:16
  • @min.yong.yoon - Even if replacing `distance += multiply*multiply` with ` distance+=abs(multiply)` had been faster, that is something you probably shouldn't want to do. What you are doing with this change is to change the metric from the Euclidean norm to the taxicab norm. If you want Euclidean norm, that's what you need to calculate. – David Hammen Sep 26 '12 at 10:26
  • Yes. Just trying different things... – min.yong.yoon Sep 26 '12 at 11:09
  • Most likely, matlab is using a fast matrix multiplication algorithm, such as the Strassen algorithm, which can be a significant benefit for large matrices. – Vaughn Cato Sep 26 '12 at 15:10
  • 1
    Here's a related question: http://stackoverflow.com/questions/6058139/why-is-matlab-so-fast-in-matrix-multiplication – Vaughn Cato Sep 26 '12 at 15:12
  • Thanks, I saw that link, but there is no optimization clue for C++. – min.yong.yoon Sep 26 '12 at 15:34
  • I do think you have a problem using Eigen. I tried a little benchmark I post the answer in [your related question](http://stackoverflow.com/questions/12479663/cvmat-cv-8u-product-error-and-slow-cv-32f-product) – remi Sep 29 '12 at 21:16
  • I benchmarked your eigen code (up to code MatrixXd d = ..., it takes 0.62s using int matrix, 0.81 using double matrix (MatrixXd) and 0.59 using MatrixXf. Probably still longer than matlabe, but eigen is monothreaded. – remi Sep 29 '12 at 22:10

3 Answers3

3

As you observed, your code is dominated by the matrix product that represents about 2.8e9 arithmetic operations. Yopu say that Matlab (or rather the highly optimized MKL) computes it in about 0.05s. This represents a rate of 57 GFLOPS showing that it is not only using vectorization but also multi-threading. With Eigen, you can enable multi-threading by compiling with OpenMP enabled (-fopenmp with gcc). On my 5 years old computer (2.66Ghz Core2), using floats and 4 threads, your product takes about 0.053s, and 0.16s without OpenMP, so there must be something wrong with your compilation flags. To summary, to get the best of Eigen:

  • compile in 64bits mode
  • use floats (doubles are twice as slow owing to vectorization)
  • enable OpenMP
  • if your CPU has hyper-threading, then either disable it or define the OMP_NUM_THREADS environment variable to the number of physical cores (this is very important, otherwise the performance will be very bad!)
  • if you have other task running, it might be a good idea to reduce OMP_NUM_THREADS to nb_cores-1
  • use the most recent compiler that you can, GCC, clang and ICC are best, MSVC is usually slower.
ggael
  • 28,425
  • 2
  • 65
  • 71
2

One thing that is definitely hurting you in your C++ code is that it has a boatload of char to int conversions. By boatload, I mean up to 2*2782*4000*128 char to int conversions. Those char to int conversions are slow, very slow.

You can reduce this to (2782+4000)*128 such conversions by allocating a pair of int arrays, one 2782*128 and the other 4000*128, to contain the cast-to-integer contents of your char* a and char* b arrays. Work with these int* arrays rather than your char* arrays.

Another problem might be your use of int versus long. I don't work on windows, so this might not be applicable. On the machines I work on, int is 32 bits and long is now 64 bits. 32 bits is more than enough because 255*255*128 < 256*256*128 = 223.

That obviously isn't the problem.

What's striking is that the code in question is not calculating that huge 2728 by 4000 array that the Matlab code is creating. What's even more striking is that Matlab is most likely doing this with doubles rather than ints -- and it's still beating the pants off the C/C++ code.

One big problem is cache. That 4000*128 array is far too big for level 1 cache, and you are iterating over that big array 2782 times. Your code is doing far too much waiting on memory. To overcome this problem, work with smaller chunks of the b array so that your code works with level 1 cache for as long as possible.

Another problem is the optimization if (distance>min_distance) break;. I suspect that this is actually a dis-optimization. Having if tests inside your innermost loop is oftentimes a bad idea. Blast through that inner product as fast as possible. Other than wasted computations, there is no harm in getting rid of this test. Sometimes it is better to make apparently unneeded computations if doing so can remove a branch in an innermost loop. This is one of those cases. You might be able to solve your problem just by eliminating this test. Try doing that.

Getting back to the cache problem, you need to get rid of this branch so that you can split the operations over the a and b matrix into smaller chunks, chunks of no more than 256 rows at a time. That's how many rows of 128 unsigned chars fit into one of the two modern Intel chip's L1 caches. Since 250 divides 4000, look into logically splitting that b matrix into 16 chunks. You may well want to form that big 2872 by 4000 array of inner products, but do so in small chunks. You can add that if (distance>min_distance) break; back in, but do so at a chunk level rather than at the byte by byte level.

You should be able to beat Matlab because it almost certainly is working with doubles, but you can work with unsigned chars and ints.

David Hammen
  • 32,454
  • 9
  • 60
  • 108
  • Thanks David, but the casting thing is about 0.001 seconds. I tried also to put the a and b arrays in int* arrays, and the performance was worse. – min.yong.yoon Sep 26 '12 at 09:28
  • You are casting no matter what, and those casts are slow. This line, `multiply = *dataPtr++-*vocPtr++;` involves two casts from unsigned char to int. The results of `*dataPtr` and `*vocPtr` are cast to integers, then subtracted. You did something wrong if you did the casts ahead of time and got worse performance. – David Hammen Sep 26 '12 at 09:35
  • Ok, I added new answer doing prior casting into in arrays. Edit: I cannot add new answer, I am editing the original post. It will be too much large I think ... – min.yong.yoon Sep 26 '12 at 09:47
  • Yes, I used also int. The new code is using int. The performance is the same. My intuition says it is a problem of caching, I am doing too much memory accesses. I should avoid some memory acesses, but I am not able to do that. – min.yong.yoon Sep 26 '12 at 10:10
  • `char` to `int` conversions are not "very slow". It's perhaps one extra instruction, but the better data density in cache more than compensates for the conversion by requiring fewer memory accesses. – Ben Voigt Sep 26 '12 at 14:14
  • @BenVoigt yes, I also made a loop at the beginning of the function to make the arrays to be of type int. As I commented before, it is consuming more time, about 0.1 second more, while the casting loops are 0.001 seconds, so this is not being helpful... – min.yong.yoon Sep 26 '12 at 14:54
  • @min.yong.yoon - The `unsigned char` to `int` conversion obviously is not the problem. I updated my answer. – David Hammen Sep 26 '12 at 17:20
  • @DavidHammen - Thanks man. First I disabled the if (distance>min_distance) break; Then the consuming time is 1.1 seconds. Ok, Then I make the loop that says for (int j=0; j – min.yong.yoon Sep 27 '12 at 07:45
1

Matrix multiply generally uses the worst possible cache access pattern for one of the two matrices, and the solution is to transpose one of the matrices and use a specialized multiply algorithm that works on data stored that way.

Your matrix already IS stored transposed. By transposing it into the normal order and then using a normal matrix multiply, your are absolutely killing performance.

Write your own matrix multiply loop that inverts the order of indices to the second matrix (which has the effect of transposing it, without actually moving anything around and breaking cache behavior). And pass your compiler whatever options it has for enabling auto-vectorization.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • I think I am not understanding your point. In my own implementation of mininum distance (first code), I am not transposing anything. In the case of Eigen code, I need to transpose it. – min.yong.yoon Sep 26 '12 at 14:27
  • @min.yong.yoon: Well, have you looked at the code generated by the compiler for that multiply loop? Is it using SSE2 instructions? – Ben Voigt Sep 26 '12 at 15:17
  • I tried to activate the SSE2 instructions on my Visul Studio 10 project settings, but the consuming time is the same, and the code generated for the loop is the same also – min.yong.yoon Sep 26 '12 at 15:38
  • @min.yong.yoon: I think the Microsoft compiler is rather behind the times with auto-vectorization. It supports SSE instructions, but you have to use them yourself by means of intrinsics. The very latest version, the one bundled in Visual Studio 2012, finally changes this. You might benchmark the code using G++ or Intel C++, since both of those are able to vectorize automatically. – Ben Voigt Sep 26 '12 at 17:34
  • Update. My own loop is not getting better speed with SSE/SSE2, but the Eigen way is faster. With SSE2 using int matrices, 0.74 seconds, but if I use float matrices, 0.49 seconds. That is an improvement, but it is similar to my own loop. Any more ideas on that? – min.yong.yoon Sep 27 '12 at 07:49
  • @min.yong.yoon: Most of the SIMD instructions for speeding up integer math are SSE3 and SSE4, so SSE2 might not speed it up any. – Ben Voigt Sep 27 '12 at 18:46