pre-allocation vs re-allocation in Mac and Windows

Question

I run into an issue while trying to understand the difference (in terms of computational time) between re-allocate a structure each time it is needed, against allocate a priori and then re-fill (a sort of resetting to default values) the structure.

On windows and Ubuntu (WLS) I have a similar result, i.e., a lot more time when I RE-allocate, on Mac the things change.

Here's the code

#include <algorithm>
#include <random>
#include <thread>
#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>

using namespace std;
template<typename T>
inline double getMs(T start, T end) {
    return double(
        std::chrono::duration_cast<std::chrono::milliseconds>(end - start)
        .count()) /
        1000;
}

void pre_allocation(int size, int max_k) {
    vector<int> tau_star(size, 1440 * 3);
    tau_star.shrink_to_fit();

    vector<vector<int>> tau;
    tau.reserve(max_k);

    vector<vector<int>> pred_trip;
    pred_trip.reserve(max_k);

    vector<vector<int>> pred_stop;
    pred_stop.reserve(max_k);

    for (int k = 0; k < max_k; ++k) {
        pred_trip.emplace_back(size, -1);
        pred_stop.emplace_back(size, std::numeric_limits<size_t>::max());
        tau.emplace_back(size, 1440 * 3);
    }

    for (size_t i = 0; i < size; i++) {
        std::fill(tau_star.begin(), tau_star.end(), 1440 * 3);
        for (int k = 0; k < max_k; ++k) {
            std::fill(tau[k].begin(), tau[k].end(), 1440 * 3);
            std::fill(pred_trip[k].begin(), pred_trip[k].end(), -1);
            std::fill(pred_stop[k].begin(), pred_stop[k].end(), std::numeric_limits<size_t>::max());
        }
    }

}

void re_allocation(int size, int max_k) {
    for (size_t i = 0; i < size; i++) {
        vector<int> tau_star(size, 1440 * 3);
        tau_star.shrink_to_fit();

        vector<vector<int>> tau;
        tau.reserve(max_k);


        vector<vector<int>> pred_trip;
        pred_trip.reserve(max_k);

        vector<vector<int>> pred_stop;
        pred_stop.reserve(max_k);

        for (int k = 0; k < max_k; ++k) {
            pred_trip.emplace_back(size, -1);
            pred_stop.emplace_back(size, std::numeric_limits<size_t>::max());
            tau.emplace_back(size, 1440 * 3);
        }
    }
}


int main(int) {

    

    int size = 107333;
    int max_k = 3;

    
    

    auto start_pre_alloc = std::chrono::high_resolution_clock::now();
    pre_allocation(size, max_k);
    double elapsed_pre_alloc = getMs(start_pre_alloc, std::chrono::high_resolution_clock::now());
    
    auto start_re_alloc = std::chrono::high_resolution_clock::now();
    re_allocation(size, max_k);
    double elapsed_re_alloc = getMs(start_re_alloc, std::chrono::high_resolution_clock::now());

    printf("Time in pre-allocation: %.3f sec\n", elapsed_pre_alloc);
    printf("Time in RE-allocation: %.3f sec\n", elapsed_re_alloc);

  
    

    return 0;
}

These structure are actually used in a larger software, but I needed a small exaple to understand what happens.

The results are:

Windows:

Time in pre-allocation: 11.617 sec
Time in RE-allocation: 53.679 sec

Ubuntu (WLS):

Time in pre-allocation: 15.749 sec
Time in RE-allocation: 81.905 sec

Mac:

Time in pre-allocation: 9.396 sec
Time in RE-allocation: 12.408sec

The specific of my Windows machine are:

CPU - 11th Gen Intel(R) Core(TM) i7-11700KF @ 3.60GHz
RAM - 16 GB DDR4
Windows 11 Compiler - MS_VS 2022

The Mac is a Macbook Pro 2018

CPU - 6-core Intel Core i9 @2.9 GHz
RAM - 16GB 2400 MHz DDR4
macOS Big Sur Version 11.6.5

On windows I compile with VisualStudio, while on Ubuntu and Mac with the following Makefile:

# Directory for my files
MYHOME          = ${PWD}
BIN             = ${MYHOME}/bin
LIB             = ${MYHOME}/lib
SRC             = ${MYHOME}

# For Linux:
# OPTFLAG = -O2 -ffast-math -march=native -DNDEBUG -Wall -std=c++17 -DLINUX -Wall

# For Mac:
OPTFLAG = -O2 -ffast-math -DNDEBUG -Wall -std=c++17 -DLINUX -Wall
LDFLAGS = -O2 -DNDEBUG -lm -pthread -std=c++17

COMPILER    = g++ ${OPTFLAG}
LINKER      = g++ ${LDFLAGS}


# Directory for output files
OUT_DIR=bin lib


# Command line tool
cli: ${OUT_DIR} ${SRC}/main.cpp
        ${COMPILER} -c ${SRC}/main.cpp -o ${LIB}/main.o
        ${LINKER} -o ${BIN}/main ${LIB}/main.o

Is it possibile that the compiler on Mac understands that can allocate only once and just re-fill ? why this happens ?

By "things change" you mean "both versions are faster but the reallocating version is still slowest", right? So are you just asking why the penalty of redundant dynamic allocation is lower on Mac than on Windows? Or are you asking why both versions are faster on Mac? — Useless, May 03 '23 at 09:30
`tau_star.shrink_to_fit();` doesn't make much sense. And your two functions seems to do two different things. — Some programmer dude, May 03 '23 at 09:31
@Useless Yes that is the thing that I do not understand. how can be so close the time between allocating each time, and filling ? — Claudio Tomasi, May 03 '23 at 09:43
@Someprogrammerdude one function first allocates and then in a loop refills the structures. The other keep allocating them (ovverriding them) at each iteration — Claudio Tomasi, May 03 '23 at 09:53
Note that both functions `pre_allocation` and `re_allocation` have no observable behavior. Some of their parts may be then optimized away by compilers. — Daniel Langr, May 03 '23 at 09:56
exactly, so my question is, why windows and ubuntu keep the same 'behaviour' while Mac optimizes a lot the re-allocation part — Claudio Tomasi, May 03 '23 at 09:57
Well the obvious answer is: Different compilers. Unless you have explicitly installed GCC on your macOS system, `g++` is an alias for `clang++`, which is the Apple compiler. Even if you have the same compiler, differences between versions might matter as well. — Some programmer dude, May 03 '23 at 10:25
And if you really want to see the difference, look at the generated code. I recommend [the compiler explorer](https://godbolt.org) where you can change compilers and flags, and have multiple compilers simultaneously to see the different in their generated code. — Some programmer dude, May 03 '23 at 10:26
On Ubuntu I installed and compiled with clang++ instead of g++ and the result are similar: Time in pre-allocation: 9.115 sec Time in RE-allocation: 69.417 sec — Claudio Tomasi, May 03 '23 at 12:16
What do you mean "similar"? The pre-alloc version is now similar to the Mac version but the re-alloc version is still closer to the GCC WSL build. I still don't know what _specific_ observation you're confused by, because you answered my earlier choice of two possible questions with "yes". Please tell us what you EXPECTED to see instead, and why this is surprising to you. — Useless, May 03 '23 at 12:42
regarding before the use of clang even on Ubuntu: I would expect the same thing happening on all the 3 OS. Allocations costs more than writing, therefore I'm ok with what happens on windows and Ubuntu, I do not understand how the difference in the Mac between filling and allocating are not so big. Regarding the AFTER clang on Ubuntu: If this difference regards the use of clang++ on the Mac, which is different from the g++ used in windows and Ubuntu, then why clang++ on Ubuntu has the same result as before, instead of being close to Mac in the results ? — Claudio Tomasi, May 03 '23 at 12:47
So you're asking about the fine details of _how much_ slower repeated dynamic allocation _should be_ than reuse. That's a question about the quality of implementation of the standard library, the runtime library (libc `malloc` implementation etc.), and the kernel/syscall interface (probably just for the first `sbrk`). — Useless, May 03 '23 at 13:16

pre-allocation vs re-allocation in Mac and Windows

0 Answers0