1

I'd like to generate a random matrix with OpenMP like it were generated by a sequential program, i.e. if any sequential matrix generator outputs me a matrix like the following one:

1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
9.0 0.0 1.0 2.0
3.0 4.0 5.0 6.0

I want the parallel OpenMP version of the same program to generate the same matrix with no interleaved rows.


Here is how I gradually approached the problem.

Given my serial generator C function generating a matrix as a 1D array:

void generate_matrix_array(
    double *v,
    int rows,
    int columns,
    double min,
    double max,
    int seed
) {
    srand(seed);

    for (int i = 0; i < rows; i++) {
        for (int j = 0; j < columns; j++) {
            v[i*rows + j] = min + (rand() / (RAND_MAX / (max - min)));
        }
    }
}

First, I naively tried the #pragma omp parallel for directive to outer for loop; however, there's no guarantee about row ordering, since thread execution gets interleaved, so they get generated in a non-deterministic order.

Adding the ordered option would solve the issue at the price of making useless multithreading in this particular case.

In order to solve the issue, I tried to partition by hand the matrix array so that thread i would generate the i-th slice of it:

void generate_matrix_array_par(
    double *v,
    int rows,
    int columns,
    double min,
    double max,
    int seed
) {
    srand(seed);

    #pragma omp parallel \
            shared(v)
    {
        int tid = omp_get_thread_num();
        int nthreads = omp_get_num_threads();
        int rows_per_thread = round(rows / (double) nthreads);
        int rem_rows = rows % (nthreads - 1) != 0?
            rows % (nthreads - 1):
            rows_per_thread;
        int local_rows = (tid == 0)?
            rows_per_thread:
            rem_rows;
        int lower_row = tid * local_rows;
        int upper_row = ((tid + 1) * local_rows);

        printf(
            "[T%d] receiving %d of %d rows from row %d to %d\n",
            tid,
            local_rows,
            rows,
            lower_row,
            upper_row - 1
        );
        printf("\n");
        fflush(stdout);

        for (int i = lower_row; i < upper_row; i++) {
            for (int j = 0; j < columns; j++) {
                v[i*rows + j] = min + (rand() / (RAND_MAX / (max - min)));
            }
        }
    }
}

However, despite matrix vector gets properly divided among threads, for some reason unknown to me, every thread generates its rows into the matrix in a non-deterministic order, i.e. if I want to generate a 8x8 matrix with 4 threads and thread 3 is assigned to rows 4 and 5, he will generate two contiguous rows in the matrix array but in the wrong position every time, like if I didn't perform any partitioning and the omp parallel for directive was in place.

I skeptically tried, at last, to get back to naive approach by specifying shared(v) and schedule(static, 16) options to omp parallel for directive and it 'magically' happens to work:

void generate_matrix_array_par(
    double *v,
    int rows,
    int columns,
    double min,
    double max,
    int seed
) {
    srand(seed);

    int nthreads = omp_get_max_threads();
    int chunk_size = (rows * columns) / nthreads;

    #pragma omp parallel for \
            shared(v) \
            schedule(static, chunk_size)
    for (int i = 0; i < rows; i++) {
        for (int j = 0; j < columns; j++) {
            v[i*rows + j] = min + (rand() / (RAND_MAX / (max - min)));
        }
    }
}

The schedule option is being added since I read somewhere else that it gets rid of cache conflicts. Edit: Looks like schedule splits up data to thread in a round-robin fashion according to a given chunk size; so if I share N/nthreads-sized chunks among threads, data will be assigned in a single round.


Any question? YES!!!

Now, I'd like to know whether I missed or failed some consideration about the problem, since I'm not convinced about the fairness of my last version of the program, despite the fact that it is working.

bissim
  • 13
  • 5
  • May I ask why you want to make this? Why is generating in parallel a random matrix that is identical to one that is generated sequentially, something that you deem worth achieving? Just keep in mind that, generally speaking, random number generators are iterative processes, and are often not thread-safe. So there are ways, but they'll likely be more computationally intensive than the sequential version, or they'll sequentialize the calls to the RNG, essentially making the parallelism void... And BTW, your solution doesn't work (or works by pure chance). – Gilles Feb 17 '20 at 06:10
  • To achieve replicability of the experiment, of course, especially in scenarios where matrices are 8192x8192 or greater (so it's worth the try). I considered the fact that the RNG could be affected by concurrent behaviour, however I should've experienced unpredictable behaviours like a group of rows repeating a few times, yet this didn't happen; moreover, the one I indicated as current solution is the only one that actually generates the same matrix as sequential code to day, despite several executions after several reboots. – bissim Feb 17 '20 at 08:52
  • 1
    It is still not clear why you would ever want to create the matrix in parallel. Presumably you then do something to this matrix and that is most likely going to be the compute intensive part, so parallelising the construction of the matrix is probably completely unnecessary. – Qubit Feb 17 '20 at 10:12
  • I want to create a matrix in parallel for the sake of creating a matrix in parallel; I may or may not have any other computation to perform after the generation, that's not the question here. I may accept the objection that it's not worth the deal, despite I clearly stated that I'm interested into generating huge matrices made of 8192x8192 elements (and I may set a minimum threshold with an `if` clause, in case). – bissim Feb 17 '20 at 12:14
  • As I said, I can show you how to do it, but essentially, that won't be truly in parallel... Actual calls to the RNG will be sequentialized, and the "parallel" version will likely be slower than the sequential version... Now, I also know a way with a poor quality RNG, but is that really worth exploring? – Gilles Feb 17 '20 at 15:50
  • In principle, if you wanted to use decent RNG and really do it in parallel: if you know at most how many cores you'll be running, you could create an array of random number generators, each assigned to a specific chunk of the matrix. That way you can follow the same assignments regardless how many threads you are using (simply set the chunk size correctly), and create the random values in parallel. There are however some concerns about how wise it is to combine values from several instances of the same generator, if you are extremely unlucky you might get repetitions, very unlikely though. – Qubit Feb 17 '20 at 15:54
  • In fact, poor performances of my code (even with large instances) were the other reason I opened this question, even though I didn't state it clear in my post. Anyway, I didn't need a highly uniform RNG like Mersenne-Twister, my wish was simply to create a matrix with concurrent threads. It may be interesting to create an instance of RNG for each thread, maybe with every thread providing its own ID as seed. – bissim Feb 17 '20 at 18:12

0 Answers0