How to parallelize nearest neighbour search using OpenMP

Question

Basically, I have a collection std::vector<std::pair<std::vector<float>, unsigned int>> which contains pairs of templates std::vector<float> of size 512 (2048 bytes) and their corresponding identifier unsigned int.

I am writing a function in which I am provided with a template and I need to return the identifier of the most similar template in the collection. I am using dot product to compute the similarity.

My naive implementation looks as follows:

// Should return false if no match is found (ie. similarity is 0 for all templates in collection)
bool identify(const float* data, unsigned int length, unsigned int& label, float& similarity) {
    bool found = false;
    similarity = 0.f;

    for (size_t i = 0; i < collection.size(); ++i) {
        const float* candidateTemplate = collection[i].first.data();
        float consinSimilarity = getSimilarity(data, candidateTemplate, length); // computes cosin sim between two vectors, implementation depends on architecture. 

        if (consinSimilarity > similarity) {
            found = true;
            similarity = consinSimilarity;
            label = collection[i].second;
        }
    }

    return found;
}

How can I speed this up using parallelization. My collection can contain potentially millions of templates. I have read that you can add #pragma omp parallel for reduction but I am not entirely sure how to use it (and if this is even the best option).

Also note: For my dot product implementation, if the base architecture supports AVX & FMA, I am using this implementation. Will this affect performance when we parallelize since there are only a limited number of SIMD registers?

Qubit · Accepted Answer · 2020-04-06T13:51:44.787

Since we don't have access to an example that actually compiles (which would have been nice), I didn't actually try to compile the example below. Nevertheless, some minor typos (maybe) aside, the general idea should be clear.

The task is to find the highest value of similarity and the corresponding label, for this we can indeed use reduction, but since we need to find the maximum of one value and then store the corresponding label, we make use of a pair to store both values at once, in order to implement this as a reduction in OpenMP.

I have slightly rewritten your code, possibly made things a bit harder to read with the original naming (temp) of the variable. Basically, we perform the search in parallel, so each thread finds an optimal value, we then ask OpenMP to find the optimal solution between the threads (reduction) and we are done.

//Reduce by finding the maximum and also storing the corresponding label, this is why we use a std::pair. 
void reduce_custom (std::pair<float, unsigned int>& output, std::pair<float, unsigned int>& input) {
    if (input.first > output.first) output = input;
}
//Declare an OpenMP reduction with our pair and our custom reduction function. 
#pragma omp declare reduction(custom_reduction : \
    std::pair<float, unsigned int>: \
    reduce_custom(omp_out, omp_in)) \
    initializer(omp_priv(omp_orig))

bool identify(const float* data, unsigned int length, unsigned int& label, float& similarity) {
    std::pair<float, unsigned int> temp(0.0, label); //Stores thread local similarity and corresponding best label. 

#pragma omp parallel for reduction(custom_reduction:temp)
    for (size_t i = 0; i < collection.size(); ++i) {
        const float* candidateTemplate = collection[i].first.data(); 
        float consinSimilarity = getSimilarity(data, candidateTemplate, length);

        if (consinSimilarity > temp.first) {
            temp.first = consinSimilarity;
            temp.second = collection[i].second;
        }
    }

    if (temp.first > 0.f) {
        similarity = temp.first;
        label = temp.second;
        return true;
    }

    return false;
}

Regarding your concern on the limited number of SIMD registers, their number depends on the specific CPU you are using. To the best of my understanding each core has a set number of vector registers available, so as long as you were not using more than there were available before it should be fine now as well, besides, AVX512 for instance provides 32 vector registers and 2 arithemtic units for vector operations per core, so running out of compute resources is not trivial, you are more likely to suffer due to poor memory locality (particularly in your case with vectors being saved all over the place). I might of course be wrong, if so, please feel free to correct me in the comments.

Thank you for the answer Qubit, this looks great. Also you are correct, it should be a pointer `const float* candidateTemplate `. I've edited my question to reflect this change. — cyrusbehr, Feb 03 '20 at 18:16
Perhaps you could kindly answer one more question. How many threads will the `#pragma omp parallel` directive launch? Is it dependent on the system? — cyrusbehr, Feb 03 '20 at 18:28
Typically it will launch as many threads as there are logical processors available on that system (that is, as many as there are with no regard whether or not they are occupied). You will find that in some applications running fewer threads may be beneficial (i.e. hyperthreading is not always your friend) - or you may just want to use part of your resources for the task. You can control the number of threads by adding `num_threads(some_value_here)` to the omp pragma directive. You should also consider adding `schedule(static)`, which I think makes most sense in your case. — Qubit, Feb 04 '20 at 10:58
do I need to use the Private attribute so that each thread has it's own copy of the `temp` variable? ex. #pragma omp parallel for reduction(custom_reduction:temp) private(temp) — cyrusbehr, Apr 20 '20 at 23:38
No, internally OpenMP will create a thread private variable and the merge them in the end, both of these will be done according to how you specify your reduction clause. — Qubit, Apr 22 '20 at 00:50

How to parallelize nearest neighbour search using OpenMP

1 Answers1