4

I need to compute a product vector-matrix as efficiently as possible. Specifically, given a vector s and a matrix A, I need to compute s * A. I have a class Vector which wraps a std::vector and a class Matrix which also wraps a std::vector (for efficiency).

The naive approach (the one that I am using at the moment) is to have something like

Vector<T> timesMatrix(Matrix<T>& matrix)
{
    Vector<unsigned int> result(matrix.columns());
    // constructor that does a resize on the underlying std::vector

    for(unsigned int i = 0 ; i < vector.size() ; ++i)
    {
        for(unsigned int j = 0 ; j < matrix.columns() ; ++j)
        {
            result[j] += (vector[i] * matrix.getElementAt(i, j));
            // getElementAt accesses the appropriate entry
            // of the underlying std::vector
        }
    }
    return result;
}

It works fine and takes nearly 12000 microseconds. Note that the vector s has 499 elements, while A is 499 x 15500.

The next step was trying to parallelize the computation: if I have N threads then I can give each thread a part of the vector s and the "corresponding" rows of the matrix A. Each thread will compute a 499-sized Vector and the final result will be their entry-wise sum.
First of all, in the class Matrix I added a method to extract some rows from a Matrix and build a smaller one:

Matrix<T> extractSomeRows(unsigned int start, unsigned int end)
{
    unsigned int rowsToExtract = end - start + 1;
    std::vector<T> tmp;
    tmp.reserve(rowsToExtract * numColumns);
    for(unsigned int i = start * numColumns ; i < (end+1) * numColumns ; ++i)
    {
        tmp.push_back(matrix[i]);
    }
    return Matrix<T>(rowsToExtract, numColumns, tmp);
}

Then I defined a thread routine

void timesMatrixThreadRoutine
    (Matrix<T>& matrix, unsigned int start, unsigned int end, Vector<T>& newRow)
{
    // newRow is supposed to contain the partial result
    // computed by a thread
    newRow.resize(matrix.columns());
    for(unsigned int i = start ; i < end + 1 ; ++i)
    {
        for(unsigned int j = 0 ; j < matrix.columns() ; ++j)
        {
            newRow[j] += vector[i] * matrix.getElementAt(i - start, j);
        }
    }
}

And finally I modified the code of the timesMatrix method that I showed above:

Vector<T> timesMatrix(Matrix<T>& matrix)
{
    static const unsigned int NUM_THREADS = 4;
    unsigned int matRows = matrix.rows();
    unsigned int matColumns = matrix.columns();
    unsigned int rowsEachThread = vector.size()/NUM_THREADS;

    std::thread threads[NUM_THREADS];
    Vector<T> tmp[NUM_THREADS];

    unsigned int start, end;

    // all but the last thread
    for(unsigned int i = 0 ; i < NUM_THREADS - 1 ; ++i)
    {
        start = i*rowsEachThread;
        end = (i+1)*rowsEachThread - 1;

        threads[i] = std::thread(&Vector<T>::timesMatrixThreadRoutine, this,
            matrix.extractSomeRows(start, end), start, end, std::ref(tmp[i]));
    }

    // last thread
    start = (NUM_THREADS-1)*rowsEachThread;
    end = matRows - 1;
    threads[NUM_THREADS - 1] = std::thread(&Vector<T>::timesMatrixThreadRoutine, this,
        matrix.extractSomeRows(start, end), start, end, std::ref(tmp[NUM_THREADS-1]));

    for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
    {
        threads[i].join();
    }

    Vector<unsigned int> result(matColumns);
    for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
    {
        result = result + tmp[i];    // the operator+ is overloaded
    }

    return result;
}

It still works but now it takes nearly 30000 microseconds, which is almost three times as much as before.

Am I doing something wrong? Do you think there is a better approach?


EDIT - using a "lightweight" VirtualMatrix

Following Ilya Ovodov's suggestion, I defined a class VirtualMatrix that wraps a T* matrixData, which is initialized in the constructor as

VirtualMatrix(Matrix<T>& m)
{
    numRows = m.rows();
    numColumns = m.columns();
    matrixData = m.pointerToData();
    // pointerToData() returns underlyingVector.data();
}

Then there is a method to retrieve a specific entry of the matrix:

inline T getElementAt(unsigned int row, unsigned int column)
{
    return *(matrixData + row*numColumns + column);
}

Now the execution time is better (approximately 8000 microseconds) but maybe there are some improvements to be made. In particular the thread routine is now

void timesMatrixThreadRoutine
    (VirtualMatrix<T>& matrix, unsigned int startRow, unsigned int endRow, Vector<T>& newRow)
{
    unsigned int matColumns = matrix.columns();
    newRow.resize(matColumns);
    for(unsigned int i = startRow ; i < endRow + 1 ; ++i)
    {
        for(unsigned int j = 0 ; j < matColumns ; ++j)
        {
            newRow[j] += (vector[i] * matrix.getElementAt(i, j));
        }
    }
}

and the really slow part is the one with the nested for loops. If I remove it, the result is obviously wrong but is "computed" in less than 500 microseconds. This to say that now passing the arguments takes almost no time and the heavy part is really the computation.

According to you, is there any way to make it even faster?

minomic
  • 645
  • 1
  • 9
  • 22

2 Answers2

2

Actually you make a partial copy of matrix for each thread in extractSomeRows. It takes a lot of time. Redesign it so that "some rows" become virtual matrix pointing at data located in original matrix.

Ilya Ovodov
  • 373
  • 1
  • 10
  • Moreover it seems that you copy matrix data twice (1st time into std::vector tmp then into matrix) – Ilya Ovodov Feb 24 '16 at 14:35
  • Thanks for the suggestion: that would be interesting to try. Now my question is "how can I implement a virtual matrix?" – minomic Feb 24 '16 at 14:46
  • If Matrix stores its data in vector, then implementation of matrix.getElementAt(i, j) is somthing like *(vector.data() + rowcount*i + j). – Ilya Ovodov Feb 24 '16 at 15:01
  • Right now I have `return matrix[row*numColumns + column]` where `matrix` is the name of the `std::vector` that stores all the elements. I can try with `data()` to see if it's faster – minomic Feb 24 '16 at 15:06
  • If Matrix stores its data in vector, then implementation of matrix.getElementAt(i, j) is somthing like return *(vector.data() + column_count*i + j). Create class VirtualMatrix, initialize it with reference to main matrix vector.data(). Then its getElementAt(i, j) must be implemented as *(vector.data() + column_count*(i + start_row) + j). – Ilya Ovodov Feb 24 '16 at 15:11
1

Use vectorized assembly instructions for an architecture by making it more explicit that you want to multiply in 4's, i.e. for the x86-64 SSE2+ and possibly ARM'S NEON.

C++ compilers can often unroll the loop into vectorized code if you explicitly make an operation happen in contingent elements:

Simple and fast matrix-vector multiplication in C / C++

There is also the option of using libraries specifically made for matrix multipication. For larger matrices, it may be more efficient to use special implementations based on the Fast Fourier Transform, alternate algorithms like Strassen's Algorithm, etc. In fact, your best bet would be to use a C library like this, and then wrap it in an interface that looks similar to a C++ vector.

Community
  • 1
  • 1
CinchBlue
  • 6,046
  • 1
  • 27
  • 58
  • I tried unrolling by hand four operations but the execution time did not change. I admit I know pretty much nothing about FFT techniques, so I will try to read up a bit and see what I can find. – minomic Feb 26 '16 at 08:39