I'm currently having a problem parallelizing a program in c++ using openMP. I am implementing a recommendation system with a user-based collaborative filtering method. To do that, I implemented a sparse_matrix class as a dictionary of dictionaries (where I mean a sort of python dictionary). In my case, since insertion is only done at the beginning of the algorithm when data is read from file, I implemented a dictionary as a std library vector of pair objects (key, value) with a flag that indicates if the vector is sorted. if the vector is sorted, a key is searched using binary searches. otherwise the vector is first sorted and then searched. Alternatively, it is possible to scan the dictionary's entries linearly for example in loops on all the keys of the dictionary. The relevant portion of the code that is causing problems is the following
void compute_predicted_ratings_omp (sparse_matrix &targets,
sparse_matrix &user_item_rating_matrix,
sparse_matrix &similarity_matrix,
int k_neighbors)
{
// Auxiliary private variables
int user, item;
double predicted_rating;
dictionary<int,double> target_vector, item_rating_vector, item_similarity_vector;
#pragma omp parallel shared(targets, user_item_rating_matrix, similarity_matrix)\
private(user, item, predicted_rating, target_vector, item_rating_vector, item_similarity_vector)
{
if (omp_get_thread_num() == 0)
std::cout << " - parallelized on " << omp_get_num_threads() << " threads: " << std::endl;
#pragma omp for schedule(dynamic, 1)
for (size_t iter_row = 0; iter_row < targets.nb_of_rows(); ++iter_row)
{
// Retrieve target user
user = targets.row(iter_row).get_key();
// Retrieve the user rating vector.
item_rating_vector = user_item_rating_matrix[user];
for (size_t iter_col = 0; iter_col < targets.row(iter_row).value().size(); ++iter_col)
{
// Retrieve target item
item = targets.row(iter_row).value().entry(iter_col).get_key();
// retrieve similarity vector associated to the target item
item_similarity_vector = similarity_matrix[item];
// Compute predicted rating
predicted_rating = predict_rating(item_rating_vector,
item_similarity_vector,
k_neighbors);
// Set result in targets
targets.row(iter_row).value().entry(iter_col).set_value(predicted_rating);
}
}
}
}
In this function I compute the predicted rating for a series of target pairs (user, item) (this is simply a weighted average). To do that, I do an outer loop on the target users (which are on the rows of the targets sparse matrix) and I retrieve the rating vector for the current user performing a binary search on the rows of the user_item_rating_matrix. Then, for each column in the current row (i.e. for each item) I retrieve another vector associated to the current item from the sparse matrix similarity_matrix. With these two vectors, I compute the prediction as a weighted average of their elements (on a subset of the items in common between the two vectors).
My problem is the following: I want to parallelize the outer loop using openMP. In the serial version, this functions takes around 3 secs. With openMP on 2 threads, it takes around 2 secs (which it is not bad since I still have some work imbalances in the outerloop). When using 4 threads, it takes 7 secs. I cannot understand what is the cause of this slowdown. Do you have any idea?
I have already thought about the problem and I share my considerations with you:
- I access the sparse_matrices only in read mode. Since the matrices are pre-sorted, all the binary searches should not modify the matrices and no race-conditions should derive.
- Various threads could access to the same vector of the sparse matrix at the same time. I read something about false sharing, but since I do not write in these vectors I think this should not be the reason of the slowdown.
- The parallel version seems to work fine with two threads (even if the speedup is lower than expected).
- No problem is observed with 4 threads for other choices of the parameters. In particular (cf. "Further details on predict_rating function" below), when I consider all the similar items for the weighted average and I scan the rating vector and search in the similarity vector (the opposite of what I normally do), the execution time scales well on 4 threads.
Further details on predict_rating function: This function works in the following way. The smallest between item_rating_vector and item_similarity_vector is scanned linearly and I do a binary search on the longest of the two. If the rating/similarity is positive, it is considered in the weighted average.
double predict_rating (dictionary<int, double> &item_rating_vector,
dictionary<int, double> &item_similarity_vector)
{
size_t size_item_rating_vector = item_rating_vector.size();
size_t size_item_similarity_vector = item_similarity_vector.size();
if (size_item_rating_vector == 0 || size_item_similarity_vector == 0)
return 0.0;
else
{
double s, r, sum_s = 0.0, sum_sr = 0.0;
int temp_item = 0;
if (size_item_rating_vector < size_item_similarity_vector)
{
// Scan item_rating_vector and search in item_similarity_vector
for (dictionary<int,double>::const_iterator iter = item_rating_vector.begin();
iter != item_rating_vector.end();
++iter)
{
// scan the rating vector forwards: iterate until the whole vector has
// been scanned.
temp_item = (*iter).get_key();
// Retrieve rating that user gave to temp_item (0.0 if not given)
try { s = item_similarity_vector[temp_item]; }
catch (const std::out_of_range &e) { s = 0.0; }
if (s > 0.0)
{
// temp_item is positively similar to the target item. consider it in the average
// Retrieve rating that the user gave to temp_item
r = (*iter).get_value();
// increment the sums
sum_s += s;
sum_sr += s * r;
}
}
}
else
{
// Scan item_similarity_vector and search in item_rating_vector
for (dictionary<int,double>::const_iterator iter = item_similarity_vector.begin();
iter != item_similarity_vector.end();
++iter)
{
// scan the rating vector forwards: iterate until the whole vector has
// been scanned.
temp_item = (*iter).get_key();
s = (*iter).get_value();
if (!(s > 0.0))
continue;
// Retrieve rating that user gave to temp_item (0.0 if not given)
try { r = item_rating_vector[temp_item]; }
catch (const std::out_of_range &e) { r = 0.0; }
if (r > 0.0)
{
// temp_item is positively similar to the target item: increment the sums
sum_s += s;
sum_sr += s * r;
}
}
}
if (sum_s > 0.0)
return sum_sr / sum_s;
else
return 0.0;
}
}
Further details on the hardware: I am running this program on a dell XPS15 with a quad-core i7 processor and 16Gb RAM. I execute the code on a linux virtualbox (I set the VM to use 4 processors and 4Gb RAM).
Thank in advance,
Pierpaolo