Fastest way to create a vector of indices from distance matrix in C++

Question

I have a distance matrix D of size n by n and a constant L as input. I need to create a vector v contains all entries in D such that its value is at most L. Here v must be in a specific order v = [v1 v2 .. vn] where vi contains entries in ith row of D with value at most L. The order of entries in each vi is not important.

I wonder there is a fast way to create v using vector, array or any data structure + parallization. What I did is to use for loops and it is very slow for large n.

vector<int> v;
for (int i=0; i < n; ++i){
    for (int j=0; j < n; ++j){
        if (D(i,j) <= L) v.push_back(j);
    }
}

To even consider improving your solution, we would need some basic information about how the contents of `D` were calculated in the first place. As it stands, you are comparing each element to `L` only once, which I would consider to be the absolute minimum. To improve on that, you would need to be able to know that some particular value automatically means some range can be excluded from testing (_i.e._ dynamic programming). — paddy, Oct 10 '19 at 23:45
One thing to consider is that the memory layout of `D` and/or your specific access pattern might not be cache-friendly. Cache misses on traversing large data sets can become huge bottlenecks. — paddy, Oct 10 '19 at 23:48
'D' is a matrix of 'int' type stored using eigen library. 'D' is read from a file and it's fast. I have a huge memory and I can store a matrix D up to n = 200.000 — Thomas Edison, Oct 11 '19 at 09:11
Okay, if you don't want to provide actual information, that's up to you. Bear in mind that having "huge memory" does not help one bit when it comes to L1/L2 cache in your processor and it certainly is no excuse for ignoring your program's memory layout and access patterns, especially when you are seeking advice on performance issues. — paddy, Oct 11 '19 at 09:30
This might be relevant reading for you: [Most efficient way to loop through an Eigen matrix](https://stackoverflow.com/q/16283000/1553090) — paddy, Oct 11 '19 at 10:04

Giuliano · Answer 1 · 2019-10-15T16:10:44.683

In consideration of the comments, I made the appropriate corrections (in emphasis).

Have you searched tips for writing performance code, threading, asm instructions (if your assembly is not exactly what you want) and OpenCL for parallel-processing? If not, I strongly recommend!

In some cases, declaring all for loop variables out of the for loop (to avoid declaring they a lot of times) will make it faster, but not in this case (comment from our friend Paddy).

Also, using new insted of vector can be faster, as we see here: Using arrays or std::vectors in C++, what's the performance gap? - and I tested, and with vector it's 6 seconds slower than with new,which only takes 1 second. I guess that the safety and ease of management guarantees that come with std::vector is not desired when someone is searching for performance, even because using new is not so difficult, just avoid heap overflow with calculations and remember using delete[]

user4581301 is correct here, and the following statement is untrue: Finally, if you build D in a array instead of matrix (or maybe if you copy D into a constant array, maybe...), it will be much mor cache-friendly and will save one for loop statement.

The initialization done by `vector` doesn't slow the program down by much, and you're trading off the safety and ease of management guarantees that come with `std::vector`. Don't do this lightly. Make sure that it really is necessary with profiling and after you've done it, profile again to make sure that it really does make a difference. — user4581301, Oct 11 '19 at 00:21
Sidenote: The `D(i,j)` notation suggests that the lessons of using a 1D array for a matrix may already have been taken. — user4581301, Oct 11 '19 at 00:22
This statement is completely untrue: _"declaring all for loop variables out of the for loop ... will make it faster"_. Well, let me qualify that.. There are scenarios where it is true, but _not_ in the case of the example presented in this question. — paddy, Oct 11 '19 at 01:16

score 1 · Answer 2 · answered Oct 13 '19 at 12:59

The best way is mostly depending on the context. If you are seeking for GPU parallization you should take a look at OpenCL.

For CPU based parallization the C++ standard #include <thread> library is probably your best bet, but you need to be careful:

Threads take time to create so if n is relatively small (<1000 or so) it will slow you down
D(i,j) has to be readably by multiple threads at the same time
v has to be writable by multiple threads, a standard vector wont cut it

v may be a 2d vector with vi as its subvectors, but these have to be initialized before the parallization:

std::vector<std::vector<int>> v; 
v.reserve(n);                    
for(size_t i = 0; i < n; i++)
{
    v.push_back(std::vector<int>());
}

You need to decide how many threads you want to use. If this is for one machine only, hardcoding is a valid option. There is a function in the thread library that gets the amount of supported threads, but it is more of a hint than trustworthy.

size_t threadAmount = std::thread::hardware_concurrency(); //How many threads should run hardware_concurrency() gives you a hint, but its not optimal
std::vector<std::thread> t;                                //to store the threads in
t.reserve(threadAmount-1);                                 //you need threadAmount-1 extra threads (we already have the main-thread)

To start a thread you need a function it can execute. In this case this is to read through part of your matrix.

void CheckPart(size_t start, size_t amount, int L, std::vector<std::vector<int>>& vec)
{
    for(size_t i = start; i < amount+start; i++)
    {
        for(size_t j = 0; j < n; j++)
        {
            if(D(i,j) <= L)
            {
                vec[i].push_back(j);
            }
        }
    }
}

Now you need to split your matrix in parts of about n/threadAmount rows and start the threads. The thread constructor needs a function and its parameter, but it will always try to copy the parameters, even if the function wants a reference. To prevent this, you need to force using a reference with std::ref()

int i = 0;
int rows;
for(size_t a = 0; a < threadAmount-1; a++)
{
    rows = n/threadAmount + ((n%threadAmount>a)?1:0);
    t.push_back(std::thread(CheckPart, i, rows, L, std::ref(v)));
    i += rows;
}

The threads are now running and all there is to do is run the last block on the main function:

SortPart(i, n/threadAmount, L, v);

After that you need to wait for the threads finishing and clean them up:

for(unsigned int a = 0; a < threadAmount-1; a++)
{
    if(t[a].joinable())
    {
        t[a].join();
    }
}

Please note that this is just a quick and dirty example. Different problems might need different implementation, and since I can't guess the context the help I can give is rather limited.

Fastest way to create a vector of indices from distance matrix in C++

2 Answers2