I have a two dimensional matrix that needs to be improved its performance with OpenMP
#pragma omp parallel private(i, j, k) num_threads(4)
{
for(i=0; i<N; i++) {
#pragma omp for schedule(dynamic, 2) // this line cannot be put before the first loop (i) cause of algorithm
for(j=i+1; j<N; j++) {
for(k=i+1; k<N; k++) {
A[j][k] -= A[i][k]*A[j][i];
}
}
}
}
I wonder that could i improve cache coherence or use task for a speedup?
Thanks !
Note: The above code is a part of LU Factorization Algorithm. I want to simplify my question. The following code is the real one that i used:
#pragma omp parallel private(i, j, k) num_threads(4)
{
for(i=0; i<N; i++) {
#pragma omp for schedule(dynamic, 2) nowait
for(j=i+1; j<N; j++) {
A[j][i] = A[j][i]/A[i][i];
}
#pragma omp barrier
#pragma omp for schedule(dynamic, 2) nowait
for(j=i+1; j<N; j++) {
for(k=i+1; k<N; k++) {
A[j][k] -= A[i][k]*A[j][i];
}
}
#pragma omp barrier
}
}