Make matrix computation faster in c

Question

i want to make my program faster by changing algorithm or using pthreads, but i couldn't figure out how to use pthreads and what algorithm to apply. can anyone help me out? Any algorithm to make matrix addition sum of matrix find min and max etc faster? g_height is row of matrix and g_width is column of matrix.

/**
 * Returns new matrix with scalar added to each element
 */
uint32_t* scalar_add(const uint32_t* matrix, uint32_t scalar) {

    uint32_t* result = new_matrix();

    /*
        to do

        1 0        2 1
        0 1 + 1 => 1 2

        1 2        5 6
        3 4 + 4 => 7 8
    */
    for (ssize_t y = 0; y < g_height; y++) {
        for (ssize_t x = 0; x < g_width; x++) {
            result[y * g_width + x]=matrix[y * g_width + x]+scalar;
        }
    }
    return result;
}

/**
 * Returns new matrix with scalar multiplied to each element
 */
uint32_t* scalar_mul(const uint32_t* matrix, uint32_t scalar) {

    uint32_t* result = new_matrix();

    /*
        to do

        1 0        2 0
        0 1 x 2 => 0 2

        1 2        2 4
        3 4 x 2 => 6 8
    */
    for (ssize_t y = 0; y < g_height; y++) {
        for (ssize_t x = 0; x < g_width; x++) {
                result[y * g_width + x]=matrix[y * g_width + x]*scalar;
        }
    }
    return result;
}

/**
 * Returns new matrix with elements added at the same index
 */
uint32_t* matrix_add(const uint32_t* matrix_a, const uint32_t* matrix_b) {

    uint32_t* result = new_matrix();

    /*
        to do

        1 0   0 1    1 1
        0 1 + 1 0 => 1 1

        1 2   4 4    5 6
        3 4 + 4 4 => 7 8
    */
    for (ssize_t y = 0; y < g_height; y++) {
        for (ssize_t x = 0; x < g_width; x++){
                result[y * g_width + x]=matrix_a[y * g_width + x]+matrix_b[y * g_width + x];
        }    
    }
    return result;
}

/**
 * Returns new matrix, multiplying the two matrices together
 */
uint32_t* matrix_mul(const uint32_t* matrix_a, const uint32_t* matrix_b) {

    uint32_t* result = new_matrix();

    /*
        to do

        1 2   1 0    1 2
        3 4 x 0 1 => 3 4

        1 2   5 6    19 22
        3 4 x 7 8 => 43 50
    */
    uint32_t* tempmatrix_a=cloned(matrix_a);
    for(ssize_t y=0; y<g_height; y++){
        for(ssize_t x=0; x<g_width; x++){
            for(int i=0; i<g_width; i++)
                result[y*g_width + x]+=tempmatrix_a[y*g_width + i]*matrix_b[i*g_width + x ];
        }
    }
    return result;
}

/**
 * Returns new matrix, powering the matrix to the exponent
 */
uint32_t* matrix_pow(const uint32_t* matrix, uint32_t exponent) {

    uint32_t* result = new_matrix();

    /*
        to do

        1 2        1 0
        3 4 ^ 0 => 0 1

        1 2        1 2
        3 4 ^ 1 => 3 4

        1 2        199 290
        3 4 ^ 4 => 435 634
    */
    uint32_t* tempresult=identity_matrix();
    ssize_t i;
    if (exponent == 0)
    result=identity_matrix();
    for (i = 0; i < exponent; i++) 
        tempresult = matrix_mul(tempresult, matrix);
        result=tempresult;
    return result;
}

////////////////////////////////
///       COMPUTATIONS       //
////////////////////////////////

/**
 * Returns the sum of all elements
 */
uint32_t get_sum(const uint32_t* matrix) {

    /*
        to do

        1 2
        2 1 => 6

        1 1
        1 1 => 4
    */
    int sum=0;
    for(ssize_t y=0; y<g_height; y++){
        for(ssize_t x=0; x<g_width; x++){
            sum+=matrix[y*g_width + x];
        }
    }
    return sum;
    return 0;
}

/**
 * Returns the trace of the matrix
 */
uint32_t get_trace(const uint32_t* matrix) {

    /*
        to do

        1 0
        0 1 => 2

        2 1
        1 2 => 4
    */
    int trace=0;
    for(ssize_t y=0; y<g_height; y++){
        for(ssize_t x=0; x<g_width; x++){
            if(y==x)
            trace+=matrix[y*g_width + x];
        }
    }
    return trace;
    return 0;
}

/**
 * Returns the smallest value in the matrix
 */
uint32_t get_minimum(const uint32_t* matrix) {

    /*
        to do

        1 2
        3 4 => 1

        4 3
        2 1 => 1
    */
    int min=matrix[0];
    for(ssize_t y=0; y<g_height; y++){
        for(ssize_t x=0; x<g_width; x++){
            if(min>matrix[y*g_width + x])
                min=matrix[y*g_width + x];
        }
    }
    return min;
    return 0;
}

/**
 * Returns the largest value in the matrix
 */
uint32_t get_maximum(const uint32_t* matrix) {

    /*
        to do

        1 2
        3 4 => 4

        4 3
        2 1 => 4
    */
    int max=matrix[0];
    for(ssize_t y=0; y<g_height; y++){
        for(ssize_t x=0; x<g_width; x++){
            if(max<matrix[y*g_width + x])
                max=matrix[y*g_width + x];
        }
    }
    return max;
    return 0;
}

/**
 * Returns the frequency of the value in the matrix
 */
uint32_t get_frequency(const uint32_t* matrix, uint32_t value) {

    /*
        to do

        1 1
        1 1 :: 1 => 4

        1 0
        0 1 :: 2 => 0
    */
    int frequency=0;
    for(ssize_t y=0; y<g_height; y++){
        for(ssize_t x=0; x<g_width; x++){
            if(matrix[y*g_width + x]==value)
                frequency++;
        }
    }
    return frequency;
    return 0;
}

How large are your matrices, typically? Unless your matrix dimensions are huge, you will hardly benefit from multithreading due to its overhead. You would probably benefit from SIMD (SSE) ([this thread](http://stackoverflow.com/q/14967969/69809), for example), but I would suggest profiling to make sure you are not wasting time. Apart from that, the only algorithm that you could improve is [multiplication](http://en.wikipedia.org/wiki/Matrix_multiplication_algorithm), and even that won't make sense for smaller dimensions. What are you doing with these? — vgru, May 21 '15 at 10:55
This `if(matrix[y * g_width + x]!=0){ result[y * g_width + x]=matrix[y * g_width + x]+scalar; } else{ result[y * g_width + x]=scalar; }` is hilarious. Do you *really* think it's faster to do a conditional jump in order to skip a single unsigned 32-bit addition? — EOF, May 21 '15 at 10:59
the dimension is between 1 and 10000 and i am sure that both pthreads and avx2 works — john, May 21 '15 at 11:00
Consider disclosing what the program you are trying to speed up is supposed to do. _If_ you can introduce a matrix type (_and_ your matrices are big enough), consider accumulating scalar offset & scaling, evaluating only when bookkeeping promises to get unwieldy. — greybeard, May 21 '15 at 16:18

score 2 · Answer 1 · edited May 23 '17 at 10:26

2

Three different directions:

In practice, your best bet to speed up the multiplication is to use some library like MKL or Atlas
You can find in this Wikipedia section how matrix multiplication decomposes naturally to block which can be parallelized. Also, see Codor's excellent point below for more cache-friendly versions. I would not advise trying this in practice until you have a thorough understanding of why the previous option did not work for you.
Matrix multiplication is composed of many scalar products. It's often more efficient to parallelize not by blocks but by such tiny products. Again, your safest bet is to let some library (like the 1st point) do this for you. It's very tricky to actually get speedup from these things.

edited May 23 '17 at 10:26

Community

1
1

answered May 21 '15 at 11:02

Ami Tavory

74,578
11
141
185

Thanks. But i am supposed to make use of pthreads, any hints on using that in the program. i know how to use that method in very simple one, but i am not familiar with pthreads, so don't know how to do it in this program – john May 21 '15 at 11:05
@Ami Tavory From a brief look to the link you gave, it seems like a description of a parallelized version of Strassen's algorithm, which can speed up matrix multiplication a lot. However, a cache-friendly memory layout for the matrices is advisavble here, for instace [Morton](http://en.wikipedia.org/wiki/Z-order_curve) order, and the implementation is a bit involved. – Codor May 21 '15 at 12:05
1

@Codor That is a good point, but do you mean from a practical point of view? I don't think it's really feasible to even approximate the cache considerations that Intel's engineers do with MKL. Notwithstanding, will add your point. – Ami Tavory May 21 '15 at 12:08
Thanks for the comment - if you mean [MKL](https://software.intel.com/en-us/intel-mkl), I wasn't even aware of that. I doubt that a selfmade implementation would outperform a professional library; I was more speaking in general. I would expect a naive implementation of Strassen's algorithm to yield no advantage over the method derived from the definition for relatively small matrices. – Codor May 21 '15 at 12:24
So i want to apply stressen algorithm to my matrix muli part, but don't know how to. – john May 21 '15 at 13:42

Make matrix computation faster in c

1 Answers1