I'd like to generate a random matrix with OpenMP like it were generated by a sequential program, i.e. if any sequential matrix generator outputs me a matrix like the following one:
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
9.0 0.0 1.0 2.0
3.0 4.0 5.0 6.0
I want the parallel OpenMP version of the same program to generate the same matrix with no interleaved rows.
Here is how I gradually approached the problem.
Given my serial generator C function generating a matrix as a 1D array:
void generate_matrix_array(
double *v,
int rows,
int columns,
double min,
double max,
int seed
) {
srand(seed);
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
v[i*rows + j] = min + (rand() / (RAND_MAX / (max - min)));
}
}
}
First, I naively tried the #pragma omp parallel for
directive to outer for
loop; however, there's no guarantee about row ordering, since thread execution gets interleaved, so they get generated in a non-deterministic order.
Adding the ordered
option would solve the issue at the price of making useless multithreading in this particular case.
In order to solve the issue, I tried to partition by hand the matrix array so that thread i would generate the i-th slice of it:
void generate_matrix_array_par(
double *v,
int rows,
int columns,
double min,
double max,
int seed
) {
srand(seed);
#pragma omp parallel \
shared(v)
{
int tid = omp_get_thread_num();
int nthreads = omp_get_num_threads();
int rows_per_thread = round(rows / (double) nthreads);
int rem_rows = rows % (nthreads - 1) != 0?
rows % (nthreads - 1):
rows_per_thread;
int local_rows = (tid == 0)?
rows_per_thread:
rem_rows;
int lower_row = tid * local_rows;
int upper_row = ((tid + 1) * local_rows);
printf(
"[T%d] receiving %d of %d rows from row %d to %d\n",
tid,
local_rows,
rows,
lower_row,
upper_row - 1
);
printf("\n");
fflush(stdout);
for (int i = lower_row; i < upper_row; i++) {
for (int j = 0; j < columns; j++) {
v[i*rows + j] = min + (rand() / (RAND_MAX / (max - min)));
}
}
}
}
However, despite matrix vector gets properly divided among threads, for some reason unknown to me, every thread generates its rows into the matrix in a non-deterministic order, i.e. if I want to generate a 8x8 matrix with 4 threads and thread 3 is assigned to rows 4 and 5, he will generate two contiguous rows in the matrix array but in the wrong position every time, like if I didn't perform any partitioning and the omp parallel for
directive was in place.
I skeptically tried, at last, to get back to naive approach by specifying shared(v)
and schedule(static, 16)
options to omp parallel for
directive and it 'magically' happens to work:
void generate_matrix_array_par(
double *v,
int rows,
int columns,
double min,
double max,
int seed
) {
srand(seed);
int nthreads = omp_get_max_threads();
int chunk_size = (rows * columns) / nthreads;
#pragma omp parallel for \
shared(v) \
schedule(static, chunk_size)
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
v[i*rows + j] = min + (rand() / (RAND_MAX / (max - min)));
}
}
}
The Edit: Looks like schedule
option is being added since I read somewhere else that it gets rid of cache conflicts.schedule
splits up data to thread in a round-robin fashion according to a given chunk size; so if I share N/nthreads-sized chunks among threads, data will be assigned in a single round.
Any question? YES!!!
Now, I'd like to know whether I missed or failed some consideration about the problem, since I'm not convinced about the fairness of my last version of the program, despite the fact that it is working.