Intel 3770K assembly code - align 16 has unexpected effects

Question

I first posted about this issue in this question:

Indexed branch overhead on X86 64 bit mode

I've since noticed this in a few other assembly code programs, where align 16 has little effect, or in some cases makes the situation worse. In my prior question, I was also comparing aligning to even or odd multiples of 16 with significant difference in the case of small, tight loops.

The most recent example I encountered this issue with is a program to calculate pi to about 1 million digits, using a 4 term arctan series (Machin type forumla), combined with multi-threading, a mini-version of the approached used at Tokyo University in 2002 to calculate over 1 trillion digits

http://www.super-computing.org/pi_current.html.en

The aligns had almost no effect on the compute time, but removing them decreased the conversion from fractional to decimal from 7.5 seconds to 6.8 seconds, a bit over a 9% decrease. Rearranging the compute code in some cases increased the time from 98 seconds to 109 seconds, about 11% increase. However the worst case was my prior question, where there was a 36.5% increase in time for a tight loop depending on where the loop was located.

I'm wondering if this is specific to the Intel 3770K 3.5 ghz processor I'm running these tests on.

If a loop doesn't fit in the loop buffer (28 uops, or 56 on IvB when running a single thread, IIRC) the way it packs into uop cache lines can matter. So 32-byte boundaries matter. You can maybe use perf counters to look for hotspots where fewer than 4 uops were sent from the uop cache to the IDQ in a cycle. I only have a Skylake so I can't really investigate (it fetches up to 6 uops per clock from the uop cache to the IDQ, not 4.) — Peter Cordes, Jun 21 '18 at 13:32
BTW, if the conversion involved micro-coded instructions like `div`, it's quite possible that different alignment broke the uop cache, leading to switching between uop cache and legacy decoders. [Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs](https://stackoverflow.com/q/26907523) — Peter Cordes, Jun 21 '18 at 14:05
@PeterCordes - I'm using multiply / shift to implement divide by constant. Some of these aligns were being used to align functions, and most of the loops are not that small (unlike my prior question). I don't think Visual Studios ML64.EXE allows align 32, unless there's a way to align a segment on a 32 byte boundary, I'll have to experiment with this. — rcgldr, Jun 21 '18 at 21:42

Intel 3770K assembly code - align 16 has unexpected effects

0 Answers0