What am I doing with SIMD and pthreads that is slowing my program down?

Question

!!! HOMEWORK - ASSIGNMENT !!!

Please do not post code as I would like to complete myself but rather if possible point me in the right direction with general information or by pointing out mistakes in thought or other possible useful and relevant resources.

I have a method that creates my square npages * npages matrix hat of double for use in my pagerank algorithm.

I have made it with pthreads, SIMD and with both pthreads and SIMD. I have used xcode instruments time profiler and found that the pthreads only version is the fastest, next is the SIMD only version and slowest is the version with both SIMD and pthreads.

As it is homework it can be run on multiple different machines however we were given the header #include so it is to be assumed we can use upto AVX at least. We are given how many threads the program will use as the argument to the program and store it in a global variable g_nthreads.

In my tests I have been testing it on my machine which is an IvyBridge with 4 hardware cores and 8 logical cores and I've been testing it with 4 threads as an arguments and with 8 threads as an argument.

RUNNING TIMES:

SIMD ONLY:

*331ms - for consturct_matrix_hat function *

PTHREADS ONLY (8 threads):

70ms - each thread concurrently

SIMD & PTHREADS (8 threads):

110ms - each thread concurrently

What am I doing that is slowing it down more when using both forms of optimisation?

I will post each implementation:

All versions share these macros:

#define BIG_CHUNK    (g_n2/g_nthreads)
#define SMALL_CHUNK  (g_npages/g_nthreads)
#define MOD          BIG_CHUNK - (BIG_CHUNK % 4)
#define IDX(a, b)    ((a * g_npages) + b)

Pthreads:

// struct used for passing arguments
typedef struct {
    double* restrict m;
    double* restrict m_hat;
    int t_id;
    char padding[44];
} t_arg_matrix_hat;

// Construct matrix_hat with pthreads
static void* pthread_construct_matrix_hat(void* arg) {
    t_arg_matrix_hat* t_arg = (t_arg_matrix_hat*) arg;
    // set coordinate limits thread is able to act upon
    size_t start = t_arg->t_id * BIG_CHUNK;
    size_t end = t_arg->t_id + 1 != g_nthreads ? (t_arg->t_id + 1) * BIG_CHUNK : g_n2;

    // Initialise coordinates with given uniform value
    for (size_t i = start; i < end; i++) {
        t_arg->m_hat[i] = ((g_dampener * t_arg->m[i]) + HAT);
    }

    return NULL;
}

// Construct matrix_hat
double* construct_matrix_hat(double* matrix) {
    double* matrix_hat = malloc(sizeof(double) * g_n2);

    // create structs to send and retrieve matrix and value from threads
    t_arg_matrix_hat t_args[g_nthreads];
    for (size_t i = 0; i < g_nthreads; i++) {
        t_args[i] = (t_arg_matrix_hat) {
            .m = matrix,
            .m_hat = matrix_hat,
            .t_id = i
        };
    }
    // create threads and send structs with matrix and value to divide the matrix and
    // initialise the coordinates with the given value
    pthread_t threads[g_nthreads];
    for (size_t i = 0; i < g_nthreads; i++) {
        pthread_create(threads + i, NULL, pthread_construct_matrix_hat, t_args + i);
    }
    // join threads after all coordinates have been intialised
    for (size_t i = 0; i < g_nthreads; i++) {
        pthread_join(threads[i], NULL);
    }

    return matrix_hat;
}

SIMD:

// Construct matrix_hat
double* construct_matrix_hat(double* matrix) {
    double* matrix_hat = malloc(sizeof(double) * g_n2);

    double dampeners[4] = {g_dampener, g_dampener, g_dampener, g_dampener};
    __m256d b = _mm256_loadu_pd(dampeners);
    // Use simd to subtract values from each other
    for (size_t i = 0; i < g_mod; i += 4) {
        __m256d a = _mm256_loadu_pd(matrix + i);
        __m256d res = _mm256_mul_pd(a, b);
        _mm256_storeu_pd(&matrix_hat[i], res);
    }
    // Subtract values from each other that weren't included in simd
    for (size_t i = g_mod; i < g_n2; i++) {
        matrix_hat[i] = g_dampener * matrix[i];
    }
    double hats[4] = {HAT, HAT, HAT, HAT};
    b = _mm256_loadu_pd(hats);
    // Use simd to raise each value to the power 2
    for (size_t i = 0; i < g_mod; i += 4) {
        __m256d a = _mm256_loadu_pd(matrix_hat + i);
        __m256d res = _mm256_add_pd(a, b);
        _mm256_storeu_pd(&matrix_hat[i], res);
    }
    // Raise each value to the power 2 that wasn't included in simd
    for (size_t i = g_mod; i < g_n2; i++) {
        matrix_hat[i] += HAT;
    }

    return matrix_hat;
}

Pthreads & SIMD:

// struct used for passing arguments
typedef struct {
    double* restrict m;
    double* restrict m_hat;
    int t_id;
    char padding[44];
} t_arg_matrix_hat;

// Construct matrix_hat with pthreads
static void* pthread_construct_matrix_hat(void* arg) {
    t_arg_matrix_hat* t_arg = (t_arg_matrix_hat*) arg;
    // set coordinate limits thread is able to act upon
    size_t start = t_arg->t_id * BIG_CHUNK;
    size_t end = t_arg->t_id + 1 != g_nthreads ? (t_arg->t_id + 1) * BIG_CHUNK : g_n2;
    size_t leftovers = start + MOD;

    __m256d b1 = _mm256_loadu_pd(dampeners);
    //
    for (size_t i = start; i < leftovers; i += 4) {
        __m256d a1 = _mm256_loadu_pd(t_arg->m + i);
        __m256d r1 = _mm256_mul_pd(a1, b1);
        _mm256_storeu_pd(&t_arg->m_hat[i], r1);
    }
    //
    for (size_t i = leftovers; i < end; i++) {
        t_arg->m_hat[i] = dampeners[0] * t_arg->m[i];
    }

    __m256d b2 = _mm256_loadu_pd(hats);
    //
    for (size_t i = start; i < leftovers; i += 4) {
        __m256d a2 = _mm256_loadu_pd(t_arg->m_hat + i);
        __m256d r2 = _mm256_add_pd(a2, b2);
        _mm256_storeu_pd(&t_arg->m_hat[i], r2);
    }
    //
    for (size_t i = leftovers; i < end; i++) {
        t_arg->m_hat[i] += hats[0];
    }

    return NULL;
}

// Construct matrix_hat
double* construct_matrix_hat(double* matrix) {
    double* matrix_hat = malloc(sizeof(double) * g_n2);

    // create structs to send and retrieve matrix and value from threads
    t_arg_matrix_hat t_args[g_nthreads];
    for (size_t i = 0; i < g_nthreads; i++) {
        t_args[i] = (t_arg_matrix_hat) {
            .m = matrix,
            .m_hat = matrix_hat,
            .t_id = i
        };
    }
    // create threads and send structs with matrix and value to divide the matrix and
    // initialise the coordinates with the given value
    pthread_t threads[g_nthreads];
    for (size_t i = 0; i < g_nthreads; i++) {
        pthread_create(threads + i, NULL, pthread_construct_matrix_hat, t_args + i);
    }
    // join threads after all coordinates have been intialised
    for (size_t i = 0; i < g_nthreads; i++) {
        pthread_join(threads[i], NULL);
    }

    return matrix_hat;
}

@EOF I'm sorry I cannot post compilable code as it is a homework question. I read through stack overflow's policies a bit further as I was seeming to ask a lot of negative questions and didn't understand fully why. I have edited the question to reflect that it is homework and ask for only general nudges or hints and specifically ask to not receive code. Any assistance in asking a better SO question is greatly appreciated. I would like to write questions that both help me but also serve as a resource to future students or developers working on similar problems. — joshuatvernon, May 31 '16 at 12:56
I made an edit to try to make the question more clear, by presenting critical information early, so reading the code is less of a puzzle in figuring out what it even does (which is necessary before we can figure out why it runs slow). You also haven't said what hardware you're running this on. (Sandybridge? Haswell? Skylake? Bulldozer? Silvermont?). How many threads did you use with the threaded version, vs. how many physical cores your machine has. (And hyperthreading?). How fast does your baseline scalar version run (or pthreads with one thread). — Peter Cordes, May 31 '16 at 14:00
Please don't re-post modified versions of [the same question](http://stackoverflow.com/questions/37527974/how-to-use-pthreads-and-simd-avx-together-for-pagerank-matrix-construction) - if you want to add more information than just [edit] the original question. — Paul R, May 31 '16 at 14:19
@PaulR It is about the same code but is not a modified version of the same question as I am asking a different question. — joshuatvernon, May 31 '16 at 15:32
@PeterCordes As it is homework we are not aware of what machine it will be running on however my machine has 4 hardware cores and 8 logical cores and is an Ivy Birdge. Also I take arguments as to what threads the program would like to take. This is with the variable `g_nthreads` — joshuatvernon, May 31 '16 at 15:44
Ok, so you don't know exactly what machine you need to tune for, but it's probably a recent Intel, too. More importantly, the question you're currently asking is about performance on your IvB with hyperthreading. Can you post the actual numbers? Just "slower" or "faster" doesn't tell us how much, and doesn't tell us how many threads you used in your test-run. Remember that the people trying to help you don't know *anything* about what you're doing, beyond what you tell us. I think you're trying to keep the question general, but we want the specific details somewhere. — Peter Cordes, May 31 '16 at 15:49
Also, you probably could have edited your previous question into this, since there weren't any answers, and only a couple comments. (Normally you shouldn't change a question into something different, so I think what you did is actually reasonable) — Peter Cordes, May 31 '16 at 15:51
Your edit doesn't even mention that you're trying to explain your test results **from an IvyBridge with HT**, or say what compiler options you used. We need that kind of detail, and the actual numbers (not just "slower" or "faster") to figure out whether it's something specific to the details of this, or whether there's something more general going on (like a memory bandwidth or cache-size issue). — Peter Cordes, May 31 '16 at 16:21
@PeterCordes I will take the changing of questions as advice for the future however had in the past added information onto a question and been advised I should have asked a further question rather than editing it. I am not trying to keep the question general on purpose. I was unaware of what details were important to ask my question. I have added information about the time it takes to run each implementation with 8 threads as the argument on my machine which is an IvyBridge to the question. Please let me know anything else I can add to better ask my question. — joshuatvernon, May 31 '16 at 16:52

Peter Cordes · Accepted Answer · 2016-06-02T06:03:11.660

I think it's because your SIMD code is horribly inefficient: It loops over the memory twice, instead of doing the add with the multiply, before storing. You didn't test SIMD vs. a scalar baseline, but if you had you'd probably find that your SIMD code wasn't a speedup with a single thread either.

STOP READING HERE if you want to solve the rest of your homework yourself.

If you used gcc -O3 -march=ivybridge, the simple scalar loop in the pthread version probably auto-vectorized into something like what you should have done with intrinsics. You even used restrict, so it might realize that the pointers can't overlap with each other, or with g_dampener.

// this probably autovectorizes well.
// Initialise coordinates with given uniform value
for (size_t i = start; i < end; i++) {
    t_arg->m_hat[i] = ((g_dampener * t_arg->m[i]) + HAT);
}


// but this would be even safer to help the compiler's aliasing analysis:
double dampener = g_dampener;   // in case the compiler things one of the pointers might point at the global
double *restrict hat = t_arg->hat;
const double *restrict mat = t_arg->m;
... same loop but using these locals instead of

It's probably not a problem for an FP loop, since double definitely can't alias with double *.

The coding style is also pretty nasty. You should give meaningful names to your __m256d variables whenever possible.

Also, you use malloc, which doesn't guarantee that matrix_hat will be aligned to a 32B boundary. C11's aligned_alloc is probably the nicest way, vs. posix_memalign (clunky interface), _mm_malloc (have to free with _mm_free, not free(3)), or other options.

double* construct_matrix_hat(const double* matrix) {
    // double* matrix_hat = malloc(sizeof(double) * g_n2);
    double* matrix_hat = aligned_alloc(64, sizeof(double) * g_n2);

    // double dampeners[4] = {g_dampener, g_dampener, g_dampener, g_dampener};  // This idiom is terrible, and might actually compile to code that stores it 4 times on the stack and then loads.
    __m256d vdamp = _mm256_set1_pd(g_dampener);    // will compile to a broadcast-load (vbroadcastsd)
    __m256d vhat  = _mm256_set1_pd(HAT);

    size_t last_full_vector = g_n2 & ~3ULL;    // don't load this from a global.
    // it's better for the compiler to see how it's calculated from g_n2

       // ??? Use simd to subtract values from each other          // huh?  this is a multiply, not a subtract.  Also, everyone can see it's using SIMD, that part adds no new information
    // if you really want to manually vectorize this, instead of using an OpenMP pragma or -O3 on the scalar loop, then:
    for (size_t i = 0; i < last_full_vector; i += 4) {
        __m256d vmat = _mm256_loadu_pd(matrix + i);
        __m256d vmul = _mm256_mul_pd(vmat, vdamp);
        __m256d vres = _mm256_add_pd(vmul, vhat);
        _mm256_store_pd(&matrix_hat[i], vres);    // aligned store.  Doesn't matter for performance.
    }
    #if 0
    // Scalar cleanup
    for (size_t i = last_vector; i < g_n2; i++) {
        matrix_hat[i] = g_dampener * matrix[i] + HAT;
    }
    #else
    // assume that g_n2 >= 4, and do a potentially-overlapping unaligned vector
    if (last_full_vector != g_n2) {
        // Or have this always run, and have the main loop stop one element sooner (so this overlaps by 0..3 instead of by 1..3 with a conditional)
        assert(g_n2 >= 4);
        __m256d vmat = _mm256_loadu_pd(matrix + g_n2 - 4);
        __m256d vmul = _mm256_mul_pd(vmat, vdamp);
        __m256d vres = _mm256_add_pd(vmul, vhat);
        _mm256_storeu_pd(&matrix_hat[g_n2-4], vres);
    }
    #endif

    return matrix_hat;
}

This version compiles (after defining a couple globals) to the asm we expect. BTW, normal people pass sizes around as function arguments. This is another way of avoiding optimization-failure due to C aliasing rules.

Anyway, really your best bet is to let OpenMP auto-vectorize it, because then you don't have to write a cleanup loop yourself. There's nothing tricky about the data organization, so it vectorizes trivially. (And it's not a reduction, like in your other question, so there's no loop-carried dependency or order-of-operations concern).

Thank you and I will fix up my coding style, especially accidentally copied comments, that is clearly very bad. May I ask what the `& -3ULL` does in the line `size_t last_full_vector = g_n2 & -3ULL`. — joshuatvernon, Jun 02 '16 at 03:28
@joshuatvernon: Oops, that's actually a bug. It was meant to round down to an aligned value. (multiple of 4). i.e. mask away the low 2 bits. But it should be `~3ULL` (ones-complement bitwise not), or (`-4LL`) rather than `-3` (two's complement negative). Anyway, it's just a "clever" way of writing `0xFFFFFFFC` with as many Fs as the type width should have. There are many idioms for it. And yes, ULL is the suffix for an `unsigned long long` constant, just like `1.0f` is a `float` constant instead of the default `double`. — Peter Cordes, Jun 02 '16 at 06:00

What am I doing with SIMD and pthreads that is slowing my program down?

1 Answers1

STOP READING HERE if you want to solve the rest of your homework yourself.