1

I have a strange problem and maybe one of you has an idea what is going on there.

The code I'm working on is a longish and complex simulation code. I have a function matrixSetup which is called at the beginning and where I measure the runtime of. After setting up my matrix and doing many other stuff, I'm running my solver and so on.

Now I changed something on my solver code and this should not influence the runtime of the matrix setup. However, I see an increase there from 90 to 150 seconds. Without touching that piece of code. Why? How?

This time difference is fully reproducible. Undoing the change in the solver gives back the fast matrixSetup. Doing other changes in the solver might or might not lead to the same increase in runtime, all reproducible. The runs have been caried out in an isolated way on an empty compute node, so no influence from there.

When using vTune to find out where the increase in runtime occurs, I end up at a simple loop (in a loop nest):

for (l = 0; l < nrConnects; l++)
    if (connectedPartitions[l] == otherParti) {
        nrCommonCouplNodes[l]++; 
        pos = l; 
        break;
    }

Does anybody have an idea what is going on there? The compiler genrated instructions are fully the same regarding to vTune. I'm using the Intel compiler, version 19.0.1.

I was playing around with compiler flags a little bit. When stating -fpic (Determines whether the compiler generates position-independent code) the increase in runtime is gone. But I assume, this causes just slightly different instructions and hence does not heal the real problem I'm facing.

With Clang, I do not see (at least here) this behaviour...

Any ideas on the reason for the increased runtime? I'm very curious...

Cheers Michael

Oichlober
  • 31
  • 3
  • seems weird. Maybe that code is landing on some sort of memory boundary, be it cache-line or page or something, not sure. Have you looked to see if the generated assembly for that part is the same? Is one of the variables in that part statically allocated? if so, it may have landed in an unfortunate place. You could also try to make the linker place specific variables/functions on fixed addresses to exclude that possibility. – PeterT Aug 20 '21 at 09:35
  • Could maybe be branch-predictor aliasing (which depends on the relative alignment of two hot branches in separate parts of the code). Or possibly [32-byte aligned routine does not fit the uops cache](https://stackoverflow.com/a/61016915) - the JCC erratum, if ICC doesn't intentionally work around that performance pothole on Skylake CPUs with updated microcode. That depends on where your branch instructions land relative to a 32-byte boundary. – Peter Cordes Aug 20 '21 at 12:18
  • I think a lot of possible effects that can be the source of the issue (compiler heuristics, memory alignment, cache effects, inlining, prediction issues, etc.). Providing a *minimal reproducible/working example* can help us (and so you) a lot to understand. I know this is not easy in your case, but for now we can only make some vague speculation. – Jérôme Richard Aug 20 '21 at 18:30

1 Answers1

2

I tested now with -mbranches-within-32B-boundaries and the same code is now getting the fast runtime (also at some other places the code is faster now). The flag is proposed in a document by Intel about the JCC erratum. Thanks Peter Cordes for pointing me to that. Hopefully, it is not only fighting the symptoms but really healing the problem.

Oichlober
  • 31
  • 3
  • You can't make the CPU itself work better, all you (or the compiler) can do is avoid the performance potholes. Depending on what you mean by "really healing the problem", I think you could describe it this way. It should reliably make your code keep running fast, even with other changes to the surrounding code. If that was the real cause of the slowdown, then it does avoid it on purpose, not by accident, so that slowdown-reason won't happen again. – Peter Cordes Aug 23 '21 at 15:16