Strange performance problem

Question

I have a container similar to this one.

template <typename Nat, typename Elt>
class NatMap {
 public:
  Elt& operator[] (Nat nat) { 
    return tab [nat.GetRaw()];
  }
 private:
  Elt tab [Nat::kBound];
};

I wanted to drop the requirement for Elt to have a default constructor:

template <typename Nat, typename Elt>
class NatMap {
 public:
  Elt& operator[] (Nat nat) { 
    return ((Elt*)tab) [nat.GetRaw()];
  }
 private:
  char tab [Nat::kBound * sizeof(Elt)];
};

I use g++-4.3 and this code works 25% slower in my application than the previous one. Unfortunately the slowdown does not manifest in a synthetic benchmark. I guess it is something about compiler optimizations, aliasing, aligning, or similar stuff.

What should I do to get my performance back? (while not needing the default constructor)

Update:

Just now I tried new g++-4.4 and it gave me a following warning for the latter code:

dereferencing pointer '<anonymous>' does break strict-aliasing rules

Are you using `placement new` or some other ugly hack? Not wanting a default ctor can be worked around by defining at least one ctor. — dirkgently, Oct 30 '09 at 14:23
Placement new is not a ugly hack, but a essential language feature. Usually it is nicely wrapped in some container like vector, so people think it is obscure. Please don't do that, it's nothing more as a way to call a constructor of existing object. — Łukasz Lew, Oct 30 '09 at 14:30
And to answer your question, I do not use placement new, but I will as I have to implement a container that support objects that have only named constructors in public. — Łukasz Lew, Oct 30 '09 at 14:32
Note: The container has to by POD'ish enough so I can use memcpy on objects containing it. — Łukasz Lew, Oct 30 '09 at 14:46
Here's a very helpful answer. http://stackoverflow.com/questions/98650/what-is-the-strict-aliasing-rule — P Shved, Oct 30 '09 at 14:54
As for `placement new` I'd rather agree to disagree on the usefulness and concomitant constraints and part ;-) — dirkgently, Oct 30 '09 at 14:57
@dirkgently: ;), "... warns about code which might break the strict aliasing rules that the compiler is using for optimization." — Łukasz Lew, Oct 30 '09 at 15:08

score 1 · Answer 1 · answered Nov 03 '09 at 02:43

You may be running into alignment problems. If Elt is some size other than the native alignment type, then allocating it via placement into a character array may involve a lot of unaligned reads that you don't see when the compiler aligns it for you. Or you may be running into a problem called a load-hit-store, which some processors manifest when they write a value to memory and then read it back immediately; in those processors, it can be a stall as long as a pipeline.

Or it may be something else entirely, some kind of pathological code generation by GCC.

Unfortunately stack traces don't help track down either of these issues, as they'd just look like a load operation (lw, lb, etc) that took forty cycles instead of one. The stall is in the microcode inside the CPU, not the x86 code you've written. But looking at the assembly with the -S commandline option can help you figure out what the compiler is really emitting, and how it differs between your two implementations. Maybe there's some bad operation cropping up in one version.

I will look into that, but Elt is almost always int-sized. __atributte__ ((aligin)) doesn't help either. — Łukasz Lew, Nov 03 '09 at 13:26
attribute(align) would be foiled by placement news into a char array; after all, you could tell it to place upon any arbitrary address. — Crashworks, Nov 03 '09 at 19:33

score 0 · Answer 2 · edited May 23 '17 at 10:33

0

Small suggestion: rather than trying to make educated guesses, like if the compiler optimizations are different, you could either single-step it, or find out with this unorthodox method.

edited May 23 '17 at 10:33

Community

1
1

answered Oct 30 '09 at 16:22

Mike Dunlavey

40,059
14
91
135

Mike, I'm not searching for a slow part of the algorithm. Array access is essential and I want to know why in the case above I get 25% slower binary. – Łukasz Lew Oct 30 '09 at 19:12
In that case, single stepping each program at the instruction level is a simple way to find out. They are doing different things. Sampling will also tell you. Maybe I misunderstood you. Maybe you know what they are doing different, and you only want to know why. – Mike Dunlavey Oct 30 '09 at 22:12
... In fact, sometimes I step two programs side-by-side, in two instances of the IDE/debugger, so I can see just how they differ. It may be a little slow, but what the heck, so am I :-) – Mike Dunlavey Oct 30 '09 at 22:19
Look at the source, the programs are identical at the source level (in terms of step-by-step debugger walking) - even side by side execution will be identical. Only assembler would help, but this is impossible, as changing one line in code affects all the important parts of assembly output in chaotic manner due to whole program optimizations. So it won't help either. Running assembly side by side ... maybe, but how? even gdb can't tell which source line is responsible for which assembly line for 70% of the source. I know what I'm saying, I've been there. – Łukasz Lew Oct 30 '09 at 23:55
That's what I'm saying. Write two small apps, one one way, and one the other, so as to remove all other variables. The source is not identical. Then step at the assembly level. If gdb can't show the source line as well as the disassembly, get a mixed source-assembly listing and follow along in that. It's not chaotic. If the compiler scrambles the code too much, turn down the optimization. You will be able to understand what the compiler did. I've been there too. – Mike Dunlavey Oct 31 '09 at 00:19
This kind of question can only really be answered by doing an experiment. If anyone else can do it, you can too. – Mike Dunlavey Oct 31 '09 at 00:43
As I mentioned, this drop doesn't show up in small custom benchmark application. And reducing from -O3 to -O2 drops 80% of performance (mostly due to inlineing) – Łukasz Lew Oct 31 '09 at 23:52
Well, it may be a tough one, but I've never seen a question like this that one couldn't get to the bottom of. As I mentioned, if instruction stepping is a problem, I just take stackshots. 20 should be enough. The slow one should show about 4 samples doing something that the fast one is not doing. Then I just make sure I understand exactly what that is. Good luck. – Mike Dunlavey Nov 01 '09 at 14:16

Strange performance problem

2 Answers2