I have little experience with parallel programming and was wondering if anyone could have a quick glance at a bit of code I've written and see, if there are any obvious ways I can improve the efficiency of the computation.
The difficulty arises due to the fact that I have multiple matrix operations of unequal dimensionality that I need to compute, so I'm not sure the most condensed way of coding the computation.
Below is my code. Note this code DOES work. The matrices I am working with are of dimension approx 700x700 [see int s below] or 700x30 [int n].
Also, I am using the armadillo library for my sequential code. It may be the case that parallelizing using openMP but retaining the armadillo matrix classes is slower than defaulting to the standard library; does anyone have an opinion on this (before I spend hours overhauling!)?
double start, end, dif;
int i,j,k; // iteration counters
int s,n; // matrix dimensions
mat B; B.load(...location of stored s*n matrix...) // input objects loaded from file
mat I; I.load(...s*s matrix...);
mat R; R.load(...s*n matrix...);
mat D; D.load(...n*n matrix...);
double e = 0.1; // scalar parameter
s = B.n_rows; n = B.n_cols;
mat dBdt; dBdt.zeros(s,n); // object for storing output of function
// 100X sequential computation using Armadillo linear algebraic functionality
start = omp_get_wtime();
for (int r=0; r<100; r++) {
dBdt = B % (R - (I * B)) + (B * D) - (B * e);
}
end = omp_get_wtime();
dif = end - strt;
cout << "Seq computation: " << dBdt(0,0) << endl;
printf("relaxation time = %f", dif);
cout << endl;
// 100 * parallel computation using OpenMP
omp_set_num_threads(8);
for (int r=0; r<100; r++) {
// parallel computation of I * B
#pragma omp parallel for default(none) shared(dBdt, B, I, R, D, e, s, n) private(i, j, k) schedule(static)
for (i = 0; i < s; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < s; k++) {
dBdt(i, j) += I(i, k) * B(k, j);
}
}
}
// parallel computation of B % (R - (I * B))
#pragma omp parallel for default(none) shared(dBdt, B, I, R, D, e, s, n) private(i, j) schedule(static)
for (i = 0; i < s; i++) {
for (j = 0; j < n; j++) {
dBdt(i, j) = R(i, j) - dBdt(i, j);
dBdt(i, j) *= B(i, j);
dBdt(i, j) -= B(i, j) * e;
}
}
// parallel computation of B * D
#pragma omp parallel for default(none) shared(dBdt, B, I, R, D, e, s, n) private(i, j, k) schedule(static)
for (i = 0; i < s; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
dBdt(i, j) += B(i, k) * D(k, j);
}
}
}
}
end = omp_get_wtime();
dif = end - strt;
cout << "OMP computation: " << dBdt(0,0) << endl;
printf("relaxation time = %f", dif);
cout << endl;
If I hyper-thread 4 cores I get the following output:
Seq computation: 5.54926e-10
relaxation time = 0.130031
OMP computation: 5.54926e-10
relaxation time = 2.611040
Which suggests that although both methods produce the same result, the parallel formulation is roughly 20 times slower than the sequential.
It is possible that for matrices of this size, the overheads involved in this 'variable-dimension' problem outweighs the benefits of parallelizing. Any insights would be much appreciated.
Thanks in advance,
Jack