I first posted about this issue in this question:
Indexed branch overhead on X86 64 bit mode
I've since noticed this in a few other assembly code programs, where align 16 has little effect, or in some cases makes the situation worse. In my prior question, I was also comparing aligning to even or odd multiples of 16 with significant difference in the case of small, tight loops.
The most recent example I encountered this issue with is a program to calculate pi to about 1 million digits, using a 4 term arctan series (Machin type forumla), combined with multi-threading, a mini-version of the approached used at Tokyo University in 2002 to calculate over 1 trillion digits
http://www.super-computing.org/pi_current.html.en
The aligns had almost no effect on the compute time, but removing them decreased the conversion from fractional to decimal from 7.5 seconds to 6.8 seconds, a bit over a 9% decrease. Rearranging the compute code in some cases increased the time from 98 seconds to 109 seconds, about 11% increase. However the worst case was my prior question, where there was a 36.5% increase in time for a tight loop depending on where the loop was located.
I'm wondering if this is specific to the Intel 3770K 3.5 ghz processor I'm running these tests on.