Memory allocation optimized away by compilers

Question

In his talk "Efficiency with algorithms, Performance with data structures", Chandler Carruth talks about the need for a better allocator model in C++. The current allocator model invades the type system and makes it almost impossible to work in many project because of that. On the other hand, the Bloomberg allocator model does not invade the type system but is based on virtual functions calls which makes impossible for the compiler to 'see' the allocation and optimize it. In his talk he talks about compilers deduplicating memory allocation (1:06:47).

It took me some time to find some examples, of memory allocation optimization, but I have found this code sample, which compiled under clang, optimize away all the memory allocation and just return 1000000 without allocating anything.

template<typename T>
T* create() { return new T(); }

int main() {
    auto result = 0;
    for (auto i = 0; i < 1000000; ++i) {
        result += (create<int>() != nullptr);
    }

    return result;
}

The following paper http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3664.html also says that allocation could be fused in compilers and seems to suggest that some compilers already do that sort of things.

As I am very interested in the strategies for efficient memory allocation, I really want to understand why Chandler Carruth is against virtual calls in the Bloomberg model. The example above clearly shows that clang optimize things away when it can see the allocation.

I would like to have a "real life code" where this optimization is useful and done by any current compiler
Do you have any example of code where different allocation are fused by anu current compiler?
Do you understand what Chandler Carruth means when he says that compilers can "deduplicate" your allocation in his talk at 1:06:47?

The `create` function is a function with run-time side-effects, the compiler can't possibly know the results of those run-time side-effects at compile time. What it *can* do is see that the memory allocated isn't really used anywhere which may be a reason it may optimize away the allocations, but I would argue that it's wrong just because the compiler have no way of predicting the results at compile-time. — Some programmer dude, May 02 '15 at 22:42
@Joachim: This example has been discussed here http://stackoverflow.com/questions/25668420/clang-vs-gcc-optimization-including-operator-new. According to n3664 referred in my post, the standard is not clear wether or not it is allowed. But it seems to say that many compiler already do it. — InsideLoop, May 02 '15 at 22:45

InsideLoop · Answer 1 · 2015-05-05T16:14:27.997

I have found this amazing example which answers the first point of the initial question. Both points 2 and 3 don't have any answer yet.

#include <iostream>
#include <vector>
#include <chrono>

std::vector<double> f_val(std::size_t i, std::size_t n) {
    auto v = std::vector<double>( n );
    for (std::size_t k = 0; k < v.size(); ++k) {
        v[k] = static_cast<double>(k + i);
    }
    return v;
}

void f_ref(std::size_t i, std::vector<double>& v) {
    for (std::size_t k = 0; k < v.size(); ++k) {
        v[k] = static_cast<double>(k + i);
    }
}

int main (int argc, char const *argv[]) {
    const auto n = std::size_t{10};
    const auto nb_loops = std::size_t{300000000};

    // Begin: Zone 1
    {
        auto v = std::vector<double>( n, 0.0 );
        auto start_time = std::chrono::high_resolution_clock::now();
        for (std::size_t i = 0; i < nb_loops; ++i) {
            auto w = f_val(i, n);
            for (std::size_t k = 0; k < v.size(); ++k) {
                v[k] += w[k];
            }
        }
        auto end_time = std::chrono::high_resolution_clock::now();
        auto time = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count();
        std::cout << time << std::endl;
        std::cout << v[0] << " " << v[n - 1] << std::endl;
    }
    // End: Zone 1

    {
        auto v = std::vector<double>( n, 0.0 );
        auto w = std::vector<double>( n );
        auto start_time = std::chrono::high_resolution_clock::now();
        for (std::size_t i = 0; i < nb_loops; ++i) {
            f_ref(i, w);
            for (std::size_t k = 0; k < v.size(); ++k) {
                v[k] += w[k];
            }
        }
        auto end_time = std::chrono::high_resolution_clock::now();
        auto time = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count();
        std::cout << time << std::endl;
        std::cout << v[0] << " " << v[n - 1] << std::endl;
    }

    return 0;
}

where not a single memory allocation happens in the for loop with f_val. This only happens with Clang though (Gcc and icpc both fail on this) and when building a slighly more complicated example, the optimization is not done.

"in the for-loop with f_val": do you mean the for-loop inside f_val, or the for-loop that calls f_val? — TonyK, May 02 '15 at 23:53
@TonyK: In the for loop that calls f_val. In zone 1, it seems that there is only one allocation (for v) which is mind blowing to me. My guess is that the f_val call is inlined, then the k-loops are fused and then the compiler removes the allocation of w which becomes useless! I have checked the assembly. — InsideLoop, May 03 '15 at 00:02

Memory allocation optimized away by compilers

1 Answers1