openMP slows down when passing from 2 to 4 threads doing binary searches in a custom container

Question

I'm currently having a problem parallelizing a program in c++ using openMP. I am implementing a recommendation system with a user-based collaborative filtering method. To do that, I implemented a sparse_matrix class as a dictionary of dictionaries (where I mean a sort of python dictionary). In my case, since insertion is only done at the beginning of the algorithm when data is read from file, I implemented a dictionary as a std library vector of pair objects (key, value) with a flag that indicates if the vector is sorted. if the vector is sorted, a key is searched using binary searches. otherwise the vector is first sorted and then searched. Alternatively, it is possible to scan the dictionary's entries linearly for example in loops on all the keys of the dictionary. The relevant portion of the code that is causing problems is the following

void compute_predicted_ratings_omp (sparse_matrix &targets,
                                    sparse_matrix &user_item_rating_matrix,
                                    sparse_matrix &similarity_matrix,
                                    int k_neighbors)
{
    // Auxiliary private variables
    int user, item;
    double predicted_rating;
    dictionary<int,double> target_vector, item_rating_vector, item_similarity_vector;

    #pragma omp parallel shared(targets, user_item_rating_matrix, similarity_matrix)\
                         private(user, item, predicted_rating, target_vector, item_rating_vector, item_similarity_vector)
    {
        if (omp_get_thread_num() == 0)
        std::cout << " - parallelized on " << omp_get_num_threads() << " threads: " << std::endl;

        #pragma omp for schedule(dynamic, 1)
        for (size_t iter_row = 0; iter_row < targets.nb_of_rows(); ++iter_row)
        {
            // Retrieve target user
            user = targets.row(iter_row).get_key();

            // Retrieve the user rating vector.
            item_rating_vector = user_item_rating_matrix[user];

            for (size_t iter_col = 0; iter_col < targets.row(iter_row).value().size(); ++iter_col)
            {
                // Retrieve target item
                item = targets.row(iter_row).value().entry(iter_col).get_key();

                // retrieve similarity vector associated to the target item
                item_similarity_vector = similarity_matrix[item];

                // Compute predicted rating
                predicted_rating = predict_rating(item_rating_vector,
                                                  item_similarity_vector,
                                                  k_neighbors);

                // Set result in targets
                targets.row(iter_row).value().entry(iter_col).set_value(predicted_rating);
            }
        }
    }
}

In this function I compute the predicted rating for a series of target pairs (user, item) (this is simply a weighted average). To do that, I do an outer loop on the target users (which are on the rows of the targets sparse matrix) and I retrieve the rating vector for the current user performing a binary search on the rows of the user_item_rating_matrix. Then, for each column in the current row (i.e. for each item) I retrieve another vector associated to the current item from the sparse matrix similarity_matrix. With these two vectors, I compute the prediction as a weighted average of their elements (on a subset of the items in common between the two vectors).

My problem is the following: I want to parallelize the outer loop using openMP. In the serial version, this functions takes around 3 secs. With openMP on 2 threads, it takes around 2 secs (which it is not bad since I still have some work imbalances in the outerloop). When using 4 threads, it takes 7 secs. I cannot understand what is the cause of this slowdown. Do you have any idea?

I have already thought about the problem and I share my considerations with you:

I access the sparse_matrices only in read mode. Since the matrices are pre-sorted, all the binary searches should not modify the matrices and no race-conditions should derive.
Various threads could access to the same vector of the sparse matrix at the same time. I read something about false sharing, but since I do not write in these vectors I think this should not be the reason of the slowdown.
The parallel version seems to work fine with two threads (even if the speedup is lower than expected).
No problem is observed with 4 threads for other choices of the parameters. In particular (cf. "Further details on predict_rating function" below), when I consider all the similar items for the weighted average and I scan the rating vector and search in the similarity vector (the opposite of what I normally do), the execution time scales well on 4 threads.

Further details on predict_rating function: This function works in the following way. The smallest between item_rating_vector and item_similarity_vector is scanned linearly and I do a binary search on the longest of the two. If the rating/similarity is positive, it is considered in the weighted average.

double predict_rating (dictionary<int, double> &item_rating_vector,
                   dictionary<int, double> &item_similarity_vector)

{
size_t size_item_rating_vector = item_rating_vector.size();
size_t size_item_similarity_vector = item_similarity_vector.size();

if (size_item_rating_vector == 0 || size_item_similarity_vector == 0)
    return 0.0;
else
{
    double s, r, sum_s = 0.0, sum_sr = 0.0;
    int temp_item = 0;

    if (size_item_rating_vector < size_item_similarity_vector)
    {
        // Scan item_rating_vector and search in item_similarity_vector
        for (dictionary<int,double>::const_iterator iter = item_rating_vector.begin();
             iter != item_rating_vector.end();
             ++iter)
        {
            // scan the rating vector forwards: iterate until the whole vector has
            // been scanned.
            temp_item = (*iter).get_key();

            // Retrieve rating that user gave to temp_item (0.0 if not given)
            try { s = item_similarity_vector[temp_item]; }
            catch (const std::out_of_range &e) { s = 0.0; }

            if (s > 0.0)
            {
                // temp_item is positively similar to the target item. consider it in the average

                // Retrieve rating that the user gave to temp_item
                r = (*iter).get_value();

                // increment the sums
                sum_s += s;
                sum_sr += s * r;
            }
        }
    }
    else
    {
        // Scan item_similarity_vector and search in item_rating_vector
        for (dictionary<int,double>::const_iterator iter = item_similarity_vector.begin();
             iter != item_similarity_vector.end();
             ++iter)
        {
            // scan the rating vector forwards: iterate until the whole vector has
            // been scanned.
            temp_item = (*iter).get_key();

            s = (*iter).get_value();

            if (!(s > 0.0))
                continue;

            // Retrieve rating that user gave to temp_item (0.0 if not given)
            try { r = item_rating_vector[temp_item]; }
            catch (const std::out_of_range &e) { r = 0.0; }

            if (r > 0.0)
            {
                // temp_item is positively similar to the target item: increment the sums
                sum_s += s;
                sum_sr += s * r;
            }
        }
    }

    if (sum_s > 0.0)
        return sum_sr / sum_s;
    else
        return 0.0;
}

}

Further details on the hardware: I am running this program on a dell XPS15 with a quad-core i7 processor and 16Gb RAM. I execute the code on a linux virtualbox (I set the VM to use 4 processors and 4Gb RAM).

Thank in advance,

Pierpaolo

Could be a caching problem. Normally it is better to parallelize the outer loop, but maybe your case could benefit from better data locality if you parallelize the inner loop instead. — Manuel M, Jan 26 '15 at 17:45
Also, is your VM using all four physical CPUs? Or maybe it is using four *logical* CPUs out of the eight logical CPUs in your system? (4 cores + hyperthreading = 8 logical cores) — Manuel M, Jan 26 '15 at 17:47
Hi Manuel. thanks for the suggestions. I will try to parallelize the inner loop and I will keep you updated. I am pretty sure that the VM is using the 4 physical CPUs (I set the number of CPUs to 4 in the virtualbox manager) and this same code with other parameters is showing the expected speed-up. However, how can I check the number of physical cores that the VM is using from the bash (my VM runs ubuntu)? — Pierpaolo Necchi, Jan 26 '15 at 20:02
I tried to parallelize the inner for instead of the outer and I observe the same pattern, with a further slow-down due to overheads for initializing the threads in the outer loop. Execution time explodes when passing from 2 to 4 threads. — Pierpaolo Necchi, Jan 26 '15 at 21:25
@PierpaoloNecchi I would suggest running your code on a bare-metal computational resource. I've had a lot of trouble with VirtualBox when running any parallel software. Try it on a real server and check if that still holds. — dmg, Jan 28 '15 at 12:32

score 0 · Answer 1 · answered Jan 27 '15 at 18:46

0

It appears you might have a false sharing problem with your targets variable. False sharing is when different threads frequently write to locations near each other (same cache line). By explicitly setting the schedule to dynamic with a chunk size of 1, you are telling OpenMP to only have each thread take tasks one element at a time, thus allowing different threads to work on data that may be near each other in targets.

I would recommend removing the schedule directive just to see how the default scheduler and chunk size do. Then I would try both static and dynamic schedules while varying the chunk size substantially. If your workload or hardware platform is unbalanced, dynamic will probably win, but I would still try static.

answered Jan 27 '15 at 18:46

user2548418

1,531
10
17

Thank you for your help. I played around with the schedule and I obtained the best results if I don't specify the schedule manually. In particular, the execution times now are 2.15 sec with 1 thread, 1.84 sec with 2 threads, 4.84 sec with 4 threads. However, I still observe the slow-down when using 4 threads. It is weird since with other parameter choices I do not observe this phenomenon and therefore I think that false-sharing is not the problem. Could it be a problem with the VM and the optimization flags used when compiling? – Pierpaolo Necchi Jan 27 '15 at 20:44
Such a dramatic slowdown from 2 to 4 threads implies those threads are getting scheduled on the same physical cores. Normally with OpenMP I would recommend looking at the processor affinities, but with a VM involved I'm not sure how much you can trust that. Ideally you could take the VM out of the loop. – user2548418 Jan 27 '15 at 23:40
As a longshot, I might also try moving some of your local variable declarations (target_vector, item_rating_vector, item_similarity_vector) into the parallel region. I'm not sure I would trust how well the OpenMP support can use the shared directive on STL containers. – user2548418 Jan 27 '15 at 23:43
I tried to move the containers inside the parallel region without improvement. I am fairly sure that this is not the problem since I use the same format in other parts of my code with normal speedups. I will try to execute the program not on the VM and see if the problem persists – Pierpaolo Necchi Jan 28 '15 at 08:53

score 0 · Accepted Answer · edited May 23 '17 at 12:28

Well I found the solution to the problem myself: I post the explanation for the community. In the predict_rating function I used try/catch for handling out_of_range errors thrown by my dictionary structure when a key that is not contained in the dictionary is searched. I read on Are exceptions in C++ really slow that exception handling is computationally heavy in the case an exception is thrown. In my case, for each call of predict_rating I had multiple out_of_range error thrown and handled. I simply removed the try/catch block and wrote a function that searches in the dictionary and return a default value if that key does not exist. This modification produced a speedup of around 2000x and now the program scales well with respect to the number of threads even on the VM.

Thanks to all of you and if you have other suggestions don't hesitate!

Pierpaolo

openMP slows down when passing from 2 to 4 threads doing binary searches in a custom container

2 Answers2