Limitations of Common Subexpression Elimination in C++

Question

I was watching a talk, "Efficiency with Algorithms, Performance with Data Structures", and was surprised by the comment that in:

#include <string>
#include <unordered_map>
#include <memory>

struct Foo {
  int x;
};

Foo* getFoo(std::string key,
            std::unordered_map<std::string,
                               std::unique_ptr<Foo>> &cache) {
  if (cache[key])
    return cache[key].get();

  cache[key] = std::unique_ptr<Foo>(new Foo());
  return cache[key].get();
}

Foo* getFooBetter(std::string key,
                  std::unordered_map<std::string,
                                     std::unique_ptr<Foo>> &cache) {
  std::unique_ptr<Foo> &entry = cache[key];
  if (entry)
    return entry.get();

  entry = std::unique_ptr<Foo>(new Foo());
  return entry.get();
}

getFooBetter() is better. I had been under the belief that I could rely on the compiler for performing this sort of transformation in the same way that I'd expect multiple occurrences of x+y to be evaluated only once. Unsurprisingly, the generated LLVM IR indeed agrees with the presenter. Even with -O9, we're left with 3 calls to cache[key] in the getFoo() version.

I've moved the long LLVM IR of both with c++ symbols unmangled out of line so as not to be visually offensive.

Another StackOverflow question reveals that part of the answer here is that operator[] is assumed to be able to modify whatever global state it wishes, and thus we can't elide calls. A linked proposal about introducing a [[pure]] annotation talks about its applications to CSE.

If we stayed at 4 calls, I'd be able to end here feeling satisfied. However, if my reading of the IR is correct, it looks like we optimized getFoo() into as if we wrote:

Foo* getFoo(std::string key,
            std::unordered_map<std::string,
                               std::unique_ptr<Foo>> &cache) {
  if (cache[key])
    return cache[key].get();

  std::unique_ptr<Foo> &entry = cache[key];
  entry = std::unique_ptr<Foo>(new Foo());
  return entry.get();
}

Would someone be able to explain what clang's view of the code is such that it was able to merge the two last cache[key]s, but not all of them? (My local clang is 3.4.)

Wild guess (I don't know Clang internals): it might be a result of inline expansion followed by further optimizations; in this case, after expansion, compiler does have enough information about the exact nature of the expanded fragments, and can eliminate duplicates without the risk of unintended side effects and without [[pure]]. Still, relying on any kind of optimizations to be done by compiler is generally a Bad Idea. — No-Bugs Hare, May 29 '15 at 04:51

score 4 · Accepted Answer · answered Jun 24 '15 at 18:31

CSE implementation in llvm is operating on Arithmatic Expressions. You can see llvm Common Subexpression Elimination source code in llvm/lib/Transforms/Scalar/EarlyCSE.cpp

Case we face here is Interprocedural Optimizations.

This call cache[key] turn out to be [](cache,key) function. so optimizations such as inlining may come into action depending upon cost of inlining of [] function. Chandler was mentioning same issue, given hash function is computationally expensive, inlining is prevented, one ends up computing hash function more than once!

In case inlining happened, IR at -O3, cache[key] was computed first and given cache key has not mutated at all such call will be optimized to same SSA value.

In case of cache[key].get(), we would normally write IR as cache[key] returns object and getting value of field using getelementpointer inside get() . With optimization turned on this IR turn out to be our previously calculated SSA value for 'cache[key]` with element accessing from struct of unique pointer.

Coming Back to getFooBetter() in worst case if compiler fails to do optimizations across procedures, more occurrences of cache[key] will result in more computation and this call even at O3 will appear as it is!

score 2 · Answer 2 · answered Jun 14 '15 at 23:51

There is a lot of stuff going on in an unordered_map lookup. There are hash computations, searching through a bin, adding to the bin if it wasn't there, maybe growing the table if it is now too big. It is not the same as comparing that you have two instances of "x+y" in your code. You should be more amazed it actually discovered that two of the calls can be merged. (I am.)

As a general rule, I would not count on a compiler discovering that two function calls can be shared, and, when performance matters, I would do common sub expression elimination in the source myself. I wouldn't even expect that it would find that sin(x) was the same, until constexpr is fully implemented.

Limitations of Common Subexpression Elimination in C++

2 Answers2