How to Speed up these functions?

Question

I have two functions that I am working with, and I am attempting to make them run as much as 4x faster.

void get_each_fifth(const matrix_t *matrix, long results[RESULTS_LEN]) {
    for (int i = 0; i < matrix->rows; i++) {
        for (int j = 0; j < matrix->cols; j++) {
            int q = j % RESULTS_LEN;
            results[q] += MGET(matrix, i, j);
        }
    }
}

The function above will need to be optimized to be 4x faster. In this function, I am finding the sums of integers based on their location in the matrix. Elements in column 0, 5, 10, etc. go into the first element of the results array. Elements in column 1, 6, 11, etc. go into the second column of the array. This pattern continues for the remaining columns. To summarize, the numbers in column i go into element i % 5 of the results array.

long get_each(const matrix_t *matrix) {
    long sum = 0;
    for (int i = 0; i < matrix->rows; i++) {
        for (int j = 0; j < matrix->cols; j++) {
            sum += MGET(matrix, i, j);
        }
    }
    return sum;
}

This one will need to be 2x faster; it is the sum all of the elements in the matrix and return the result.

MGET and MSET are defined:

#define MGET(mat, i, j) ((mat)->data[((i)*((mat)->cols)) + (j)])
#define MSET(mat, i, j, x) ((mat)->data[((i)*((mat)->cols)) + (j)] = (x))

and the matrix_t struct is defined

typedef struct {
    long rows;
    long cols;
    int *data;
} matrix_t;

and is allocated with this function:

void set_up_matrix(matrix_t *matrix, int rows, int cols) {
    if (matrix == NULL) {
        return;
    }
    matrix->rows = rows;
    matrix->cols = cols;
    matrix->data = malloc(sizeof(int) * rows * cols);
    srand(2021);
    for (int i = 0; i < rows; i++) {
        for (int j = 0; j < cols; j++) {
            MSET(matrix, i, j, rand() % 100);
        }
    }
}

and result len is defined:

#define RESULTS_LEN 5

Any help would be appreciated!

Many optimisations depend on the matrix structure (e.g. row or column order) and the machine's memory model and instruction set, none of which are included in the question. — Pete Kirkham, Nov 29 '21 at 23:13
The modulus is either very expensive or very cheap. Can you arrange for RESULTS_LEN to be a power of two? — Passerby, Nov 29 '21 at 23:15
To remove the modulus, you can add a new loop iterating from 0 to `RESULTS_LEN`. The performance if very dependent of what `MGET` does a pointed out by @PeteKirkham. Please provide a more complete code. Besides this, did your enabled *optimization flags* of your compiler? — Jérôme Richard, Nov 29 '21 at 23:25
Probably you have double pointer in the `matrix` which is extremely inefficient. — 0___________, Nov 29 '21 at 23:37
I Have updated the question to give some more information and clear things up a bit, I apologize for being vague. hopefully that helps. — durtdee, Nov 29 '21 at 23:39
The matrix is implemented as a single linear memory block, couldn't you implement most (if not all) of those nested loops as single loops accessing the elements of the array sequentially? The compiler may miss that "optimization". — Bob__, Nov 29 '21 at 23:46
[not the solution] most people would prefer unsigned types (size_t) as sizes and indices. — wildplasser, Nov 29 '21 at 23:46
@Bob__ that would make 'every Nth column' more tricky, and unlikely to gain 4x speed improvement. — Weather Vane, Nov 29 '21 at 23:53
Perhaps he could unroll 5 times and use different starting offsets and a step size of 5 - or have 5 counters, added as a last step - many possible options. Must use profiler. — 500 - Internal Server Error, Nov 29 '21 at 23:56
I'd want to ensure that your matrix access is _cache friendly_. See https://stackoverflow.com/questions/5200338/a-cache-efficient-matrix-transpose-program Also, consider using _pointers_ to rows (e.g.): `int *rowcur = &mat->data[irow * mat->cols];` and then: `for (icol = 0; icol < mat->cols; ++icol) rowcur[icol] = something;` to eliminate some extra multiplies when calculating the indexing for the 2D array(s) — Craig Estey, Nov 29 '21 at 23:56
Why do `matrix->rows;` and `matrix->cols;` need to be `long` when the loops variables are `int i` and `int j`? On a 32-bit machine the difference may be significant. — Weather Vane, Nov 29 '21 at 23:59
What compiler are you using? Post your compiler command, too. — Andrew Henle, Nov 30 '21 at 00:18

0___________ · Answer 1 · 2021-11-30T00:06:19.643

I would change it to flexible array member and leave the arithmetic to the compiler. It will make it more cache friendly as well. You need to make its task easier by showing what kind of array your data represents. It will allow the compiler to optimize the loops or use vector instructions (if you use some additional command line options). You can use also compiler specific pragmas or attributes like in the example below. It will unroll the loops speeding up the execution speed.

typedef struct {
    size_t rows;
    size_t cols;
    int data[];
} matrix_t;

void get_each_fifth(const matrix_t * restrict matrix, long results[RESULTS_LEN]) 
{
    int (*array)[matrix -> cols] = (int (*)[matrix -> cols])matrix -> data;
    #pragma GCC unroll 10
    for (size_t i = 0; i < matrix->rows; i++) {
        #pragma GCC unroll 10
        for (size_t j = 0; j < matrix->cols; j++) {
            size_t q = j % RESULTS_LEN;
            results[q] += array[i][j];
        }
    }
}

matrix_t *set_up_matrix(matrix_t *matrix, size_t rows, size_t cols) 
{
    int (*array)[cols];
    matrix = realloc(matrix, sizeof(*matrix) + rows * cols * sizeof(matrix -> data[0]));
    if(matrix)
    {
        matrix->rows = rows;
        matrix->cols = cols;
        srand(time(NULL));  // it should be called once in main
        array = (int (*)[cols])matrix -> data;
        for (size_t i = 0; i < rows; i++) 
        {
            for (size_t j = 0; j < cols; j++) 
            {
                array[i][j] = rand() % 100;
            }
        }
    }
    return matrix;
}

I'm not sure what the purpose of `matrix` parameter in `set_up_matrix()` is. Assuming it points to valid matrix then `realloc` will keep the effective type of `matrix->data` which is `int(*)[prev_cols]` which is likely not compatible with `int(*)[cols]`. As result the code likely invokes *UB* due to violation of strict aliasing rule. — tstanisl, Nov 30 '21 at 08:09

How to Speed up these functions?

1 Answers1