Accelerating a nested loop in C++

Question

I have the following piece of C++ code. The scale of the problem is N and M. Running the code takes about two minutes on my machine. (after g++ -O3 compilation). Is there anyway to further accelerate it, on the same machine? Any kind of option, choosing a better data structure, library, GPU or parallelism, etc, is on the table.

void demo() {
    int N = 1000000;
    int M=3000;
    vector<vector<int> > res(M);
    for (int i =0; i <N;i++) {
       for (int j=1; j < M; j++){
           res[j].push_back(i);
       }
    }
}


int main() {
  demo();
  return 0;
}

An additional info: The second loop above for (int j=1; j < M; j++) is a simplified version of the real problem. In fact, j could be in a different range for each i (of the outer loop), but the number of iterations is about 3000.

`res[j].reserve(3000);` before the inner for loop is almost certainly going to help — john, Oct 29 '20 at 07:19
possibly swapping the loops to make the memory access more sequential — , Oct 29 '20 at 07:20
Are the `M` vectors truly identical, or is that just an artifact of the example? — dxiv, Oct 29 '20 at 07:20
When handling 2D arrays it's often recommended to use a 1D array and perform the index calculation yourself. Better cache performance (or something, no expert here). — john, Oct 29 '20 at 07:20
A single vector reserved in advance rather than a nested one will probably make a big difference — Alan Birtles, Oct 29 '20 at 07:21
You should also make sure you have 12gb of free memory to avoid swapping — Alan Birtles, Oct 29 '20 at 07:22
Splitting the outer loop into threads is a possibility. And try to combine multiple different techniques. — Some programmer dude, Oct 29 '20 at 07:23
@AlanBirtles Indeed the code takes much memory. Any way to reduce it? — zell, Oct 29 '20 at 07:40
@dxiv I was unclear in the question. The second loop in the code "for (int j=1; j < M; j++)" is a simplified version which should be understood as iterating approximately less than M times. See edits in the question. — zell, Oct 29 '20 at 07:43
@dratenik What do you mean by "make the memory access more sequential"? — zell, Oct 29 '20 at 07:44
Not really, using a single vector will use slightly less memory but fundamentally if you want to create 3 billion 4 byte integers you need 12gb of memory. You could use a smaller integer if you don't need 32-bits? — Alan Birtles, Oct 29 '20 at 07:48
Try swapping the i and j loop. Does it run faster? (I tried, it does speed up something like 5x for me) That's because sequential access is faster than updating many distant memory locations at once. (insert lecture about RAM access and caches) — , Oct 29 '20 at 08:02
@AlanBirtles The integers involved (that are push_back into th evector) are between 0 and 1 million. Maybe using unsigned int could reduce memory? — zell, Oct 29 '20 at 08:03
no, 16-bits isn't enough so you'll need 32-bits whether its signed or unsigned — Alan Birtles, Oct 29 '20 at 08:11

Some programmer dude · Answer 1 · 2020-10-29T07:43:57.203

With the exact code as shown when writing this answer, you could create the inner vector once, with the specific size, and call iota to initialize it. Then just pass this vector along to the outer vector constructor to use it for each element.

Then you don't need any explicit loops at all, and instead use the (highly optimized, hopefully) standard library to do all the work for you.

Perhaps something like this:

void demo()
{
    static int const N = 1000000;
    static int const M = 3000;

    std::vector<int> data(N);
    std::iota(begin(data), end(data), 0);

    std::vector<std::vector<int>> res(M, data);
}

Just Shadow · Answer 2 · 2020-10-29T07:26:28.727

3

Alternatively you could try to initialize just one vector with that elements, and then create the other vectors just by copying that part of the memory using std::memcpy or std::copy.

Another optimization would be to allocate the memory in advance (e.g. array.reserve(3000)).

Also if you're sure that all the members of the vector are similar vectors, you could do a hack by just creating a single vector with 3000 elements, and in the other res just put the same reference of that 3000-element vector million times.

edited Oct 29 '20 at 07:26

answered Oct 29 '20 at 07:20

Just Shadow

10,860
6
57
75

Thank you!. Could you please elaborate your answer with code? – zell Oct 29 '20 at 07:22

score 1 · Accepted Answer · answered Oct 29 '20 at 08:33

On my machine which has enough memory to avoid swapping your original code took 86 seconds.

Adding reserve:

    for (auto& v : res)
    {
        v.reserve(N);
    }

made basically no difference (85 seconds but I only ran each version once).

Swapping the loop order:

for (int j = 1; j < M; j++) {
        for (int i = 0; i < N; i++) {
            res[j].push_back(i);
        }
    }

reduced the time to 10 seconds, this is likely due to a combination of allowing the compiler to use SIMD optimisations and improving cache coherency by accessing memory in sequential order.

Creating one vector and copying it into the others:

    for (int i = 0; i < N; i++) {
        res[1].push_back(i);
    }
    for (int j = 2; j < M; j++) {
        res[j] = res[1];
    }

reduced the time to 4 seconds. Using a single vector:

void demo() {
    size_t N = 1000000;
    size_t M = 3000;
    vector<int> res(M*N);
    size_t offset = N;
    for (size_t i = 0; i < N; i++) {
        res[offset++] = i;
    }
    for (size_t j = 2; j < M; j++) {
        std::copy(res.begin() + N, res.begin() + N * 2, res.begin() + offset);
        offset += N;
    }
}

also took 4 seconds, there probably isn't much improvement because you have 3,000 4 MB vectors, there would likely be more difference if N was smaller or M was larger.

Accelerating a nested loop in C++

3 Answers3