I am attempting to write a simple parallel mergesort algorithm to help me understand the concept of hypercube communication. I have a C program that generates an array of random integers and assigns threads to sort the data. Each thread serially sorts its sublist of the data and then performs compare-exchange operations with the other threads to merge the sublists. Below is the thread function.
void *th_fun(void* args)
{
int i, j;
int my_id = *((int*)args);
int chunk_size = N_DATA/N_THREADS;
int start_index = my_id*chunk_size;
//set up the memory this thread will use
th_mem[my_id].my_data = &data[start_index];
th_mem[my_id].partner_data = malloc(sizeof(int)*chunk_size);
for(i = 0; i < chunk_size; i++)
th_mem[my_id].partner_data [i] = 0;
//sort my data serially, should use a mergesort
qsort((void*)th_mem[my_id].my_data, chunk_size, sizeof(int), compare);
//wait for all threads to finish sorting
barrier_simple();
//compare-exchange accross hypercube
int partner_id, op;
int logth = log(N_THREADS)/log(2);
for(i = 0; i < logth; i++)
{
partner_id = my_id ^ (int)pow(2,logth-i-1);
//send my data to partner
for(j = 0; j < chunk_size; j++)
{
th_mem[partner_id].partner_data[j] = th_mem[my_id].my_data[j];
}
//wait for all data to be sent
barrier_simple();
//get the lowest or highest elements between my_data, partner_data
int x = my_id & (int)pow(2,logth-i-1);
op = (x > 0 ? 1:-1);
int* output = merge(th_mem[my_id].my_data, th_mem[my_id].partner_data,
chunk_size, op);
//save results
for(j = 0; j < chunk_size; j++)
{
th_mem[my_id].my_data[j] = output[j];
}
free(output);
//wait for all data to be merged
barrier_simple();
}
free(th_mem[my_id].partner_data);
pthread_exit(args);
}
Sample run with 12 data elements sorted by 4 threads(output is chunked to indicate which thread is responsible for that data):
./parallel_merge 12 4
Unsorted:
0:924 674 704
1:495 554 645
2:979 285 119
3:878 80 620
Serial sort:
0:80 119 285
1:495 554 620
2:645 674 704
3:878 924 979
Parallel sort:
0:80 119 285
1:495 554 674
2:620 645 704
3:878 924 979
As you can see the parallel sort is not entirely correct(aside from performance considerations, ofc). I have done many gdb runs and cannot determine why the hypercube exchanges are not working out. Any help would be appreciated. My references were The Hypercube Template and Mergesort