INTEL X86，why do align access and non-align access have same performance?

Question

From INTEL CPU manual(Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C & 3D):System Programming Guide 8.1.1), it says "nonaligned data accesses will seriously impact the performance of the processor". Then I do a test in order to prove it. But the result is that aligned and nonaligned data accesses have the same performance. Why??? Could someone help? My code is shown below:

#include <iostream>
#include <stdint.h>
#include <time.h>
#include <chrono>
#include <string.h>
using namespace std;

static inline int64_t get_time_ns()
{
    std::chrono::nanoseconds a = std::chrono::high_resolution_clock::now().time_since_epoch();
    return a.count();
}
int main(int argc, char** argv)
{
    if (argc < 2) {
        cout << "Usage：./test [01234567]" << endl;
        cout << "0 - aligned, 1-7 - nonaligned offset" << endl;
        return 0;
    }
    uint64_t offset = atoi(argv[1]);
    cout << "offset = " << offset << endl;
    const uint64_t BUFFER_SIZE = 800000000;
    uint8_t* data_ptr = new uint8_t[BUFFER_SIZE];
    if (data_ptr == nullptr) {
        cout << "apply for memory failed" << endl;
        return 0;
    }
    memset(data_ptr, 0, sizeof(uint8_t) * BUFFER_SIZE);
    const uint64_t LOOP_CNT = 300;
    cout << "start" << endl;
    auto start = get_time_ns();
    for (uint64_t i = 0; i < LOOP_CNT; ++i) {
        for (uint64_t j = offset; j <= BUFFER_SIZE - 8; j+= 8) { // align:offset = 0 nonalign: offset=1-7
            volatile auto tmp = *(uint64_t*)&data_ptr[j]; // read from memory
            //mov rax,QWORD PTR [rbx+rdx*1] // rbx+rdx*1 = 0x7fffc76fe019 
            //mov QWORD PTR [rsp+0x8],rax 
            ++tmp;
            //mov rcx,QWORD PTR [rsp+0x8] 
            //add rcx,0x1 
            //mov QWORD PTR [rsp+0x8],rcx
            *(uint64_t*)&data_ptr[j] = tmp; // write to memory
            //mov rcx,QWORD PTR [rbx+rdx*1],rcx
        }
    }
    auto end = get_time_ns();
    cout << "time elapse " << end - start << "ns" << endl;
    return 0;
}

RESULT:

offset = 0
start
time elapse 32991486013ns
offset = 1
start
time elapse 34089866539ns
offset = 2
start
time elapse 34011790606ns
offset = 3
start
time elapse 34097518021ns
offset = 4
start
time elapse 34166815472ns
offset = 5
start
time elapse 34081477780ns
offset = 6
start
time elapse 34158804869ns
offset = 7
start
time elapse 34163037004ns

Are you sure that the C++ compiler isn't aligning the data optimally for you? To really evaluate alignment performance, you'll need to check what the C++ compiler is generating and you may need to write your perf test in assembly language. — lurker, Dec 20 '21 at 16:13
Aligned data will never cross a cache line boundary or worse a page boundary, but unaligned data may. It is in that situation that you will see a significant performance effect. — prl, Dec 20 '21 at 16:50
Why do you want a `volatile uint64_t tmp` local variable increment inside your inner loop? What's the point of introducing an aligned store/reload there by making the load result volatile, not the access to the array? Anyway, have a look at [How can I accurately benchmark unaligned access speed on x86\_64?](https://stackoverflow.com/q/45128763) for some details on the effects you want to look for. — Peter Cordes, Dec 20 '21 at 19:28
@lurker Yeah, I am sure, I have check the assembly code, please see my code, I have put the assembly code in it. — Hankin, Dec 21 '21 at 01:21
Did you ever test `offset = 46` or `62` to actually get a cache-line split? Assuming `new` gave you memory that's either aligned to the start of a cache line, or has `ptr % 64 = 16` like glibc likes to do, keeping the start of a larger allocation for its bookkeeping metadata... If you're on any Intel since Haswell, or AMD since Zen IIRC, there are zero penalties for misalignment within a cache line. (And on earlier Intel, not much penalty within a cache line, just possible SnB bank conflicts if you had memory parallelism.) — Peter Cordes, Dec 21 '21 at 02:12
Keep in mind that some parts of Intel's optimization manual were written in Pentium 4 (netburst) days, and aren't as applicable to Sandybridge-family. Consult Agner Fog's microarch guide at https://agner.org/optimize/, although he doesn't focus a huge amount on misaligned memory costs in throughput vs. latency. — Peter Cordes, Dec 21 '21 at 02:14

BeeOnRope · Accepted Answer · 2021-12-24T02:21:44.003

On most modern x86 cores, the performance of aligned and misaligned is the same only if the access does not cross a specific internal boundary.

The exact size of the internal boundary varies based on the core architecture of the relevant CPU, but on Intel CPUs from the last decade, the relevant boundary is the 64-byte cache line. That is, accesses which fall entirely within a 64-byte cache line perform the same regardless of whether they are aligned or not.

However, if a (necessarily misaligned) access crosses a cache line boundary on an Intel chip, however, a penalty is paid of about 2x in both latency and throughput. The bottom-line impact of this penalty depends on the surrounding code and will often be much less than 2x and sometimes close to zero. This modest penalty may be much larger if a 4K page boundary is also crossed.

Aligned accesses never cross these boundaries, so cannot suffer this penalty.

The broad picture is similar for AMD chips, though the relevant boundary as been smaller than 64 bytes on some recent chips, and the boundary is different for loads and stores.

I have included additional details in the load throughput and store throughput sections of a blog post I wrote.

Testing It

Your test wasn't able to show the effect for several reasons:

The test didn't allocate aligned memory, you can't reliably cross a cache line by using an offset from a region with unknown alignment.
You iterated 8 bytes at a time, so the majority of the writes (7 out of 8) will fall in a cache line any have no penalty, leading to a small signal which will only be detectable if the rest of your benchmark is very clean.
You use a large buffer size, which doesn't fit in any level of the cache. The split-line effect is only fairly obvious at the L1, or when splitting lines mean you bring in twice the number of lines (e.g., random access). Since you access every line linearly in either scenario, you'll be limited by throughput from DRAM to the core, regardless of splits or not: the split writes have plenty of time to complete while waiting for main memory.
You use a local volatile auto tmp and tmp++ which creates a volatile on the stack and a lot of loads and stores to preserve volatile semantics: these are all aligned and will wash out the effect you are trying to measure with your test.

Here is my modification of your test, operating only in the L1 region, and which advances 64 bytes at a time, so every store will be a split if any is:

#include <iostream>
#include <stdint.h>
#include <time.h>
#include <chrono>
#include <string.h>
#include <iomanip>

using namespace std;

static inline int64_t get_time_ns()
{
    std::chrono::nanoseconds a = std::chrono::high_resolution_clock::now().time_since_epoch();
    return a.count();
}

int main(int argc, char** argv)
{
    if (argc < 2) {
        cout << "Usage：./test [01234567]" << endl;
        cout << "0 - aligned, 1-7 - nonaligned offset" << endl;
        return 0;
    }
    uint64_t offset = atoi(argv[1]);
    const uint64_t BUFFER_SIZE = 10000;
    alignas(64) uint8_t data_ptr[BUFFER_SIZE];
    memset(data_ptr, 0, sizeof(uint8_t) * BUFFER_SIZE);
    const uint64_t LOOP_CNT = 1000000;
    auto start = get_time_ns();
    for (uint64_t i = 0; i < LOOP_CNT; ++i) {
        uint64_t src = rand();
        for (uint64_t j = offset; j + 64<= BUFFER_SIZE; j+= 64) { // align:offset = 0 nonalign: offset=1-7
            memcpy(data_ptr + j, &src, 8);
        }
    }
    auto end = get_time_ns();
    cout << "time elapsed " << std::setprecision(2) << (end - start) / ((double)LOOP_CNT * BUFFER_SIZE / 64) <<
        "ns per write (rand:" << (int)data_ptr[rand() % BUFFER_SIZE] << ")" << endl;
    return 0;
}

Running this for all alignments in 0 to 64, I get:

$ g++ test.cpp -O2 && for off in {0..64}; do printf "%2d :" $off && ./a.out $off; done
 0 :time elapsed 0.56ns per write (rand:0)
 1 :time elapsed 0.57ns per write (rand:0)
 2 :time elapsed 0.57ns per write (rand:0)
 3 :time elapsed 0.56ns per write (rand:0)
 4 :time elapsed 0.56ns per write (rand:0)
 5 :time elapsed 0.56ns per write (rand:0)
 6 :time elapsed 0.57ns per write (rand:0)
 7 :time elapsed 0.56ns per write (rand:0)
 8 :time elapsed 0.57ns per write (rand:0)
 9 :time elapsed 0.57ns per write (rand:0)
10 :time elapsed 0.57ns per write (rand:0)
11 :time elapsed 0.56ns per write (rand:0)
12 :time elapsed 0.56ns per write (rand:0)
13 :time elapsed 0.56ns per write (rand:0)
14 :time elapsed 0.56ns per write (rand:0)
15 :time elapsed 0.57ns per write (rand:0)
16 :time elapsed 0.56ns per write (rand:0)
17 :time elapsed 0.56ns per write (rand:0)
18 :time elapsed 0.56ns per write (rand:0)
19 :time elapsed 0.56ns per write (rand:0)
20 :time elapsed 0.56ns per write (rand:0)
21 :time elapsed 0.56ns per write (rand:0)
22 :time elapsed 0.56ns per write (rand:0)
23 :time elapsed 0.56ns per write (rand:0)
24 :time elapsed 0.56ns per write (rand:0)
25 :time elapsed 0.56ns per write (rand:0)
26 :time elapsed 0.56ns per write (rand:0)
27 :time elapsed 0.56ns per write (rand:0)
28 :time elapsed 0.57ns per write (rand:0)
29 :time elapsed 0.56ns per write (rand:0)
30 :time elapsed 0.57ns per write (rand:25)
31 :time elapsed 0.56ns per write (rand:151)
32 :time elapsed 0.56ns per write (rand:123)
33 :time elapsed 0.56ns per write (rand:29)
34 :time elapsed 0.55ns per write (rand:0)
35 :time elapsed 0.56ns per write (rand:0)
36 :time elapsed 0.57ns per write (rand:0)
37 :time elapsed 0.56ns per write (rand:0)
38 :time elapsed 0.56ns per write (rand:0)
39 :time elapsed 0.56ns per write (rand:0)
40 :time elapsed 0.56ns per write (rand:0)
41 :time elapsed 0.56ns per write (rand:0)
42 :time elapsed 0.57ns per write (rand:0)
43 :time elapsed 0.56ns per write (rand:0)
44 :time elapsed 0.56ns per write (rand:0)
45 :time elapsed 0.56ns per write (rand:0)
46 :time elapsed 0.57ns per write (rand:0)
47 :time elapsed 0.57ns per write (rand:0)
48 :time elapsed 0.56ns per write (rand:0)
49 :time elapsed 0.56ns per write (rand:0)
50 :time elapsed 0.57ns per write (rand:0)
51 :time elapsed 0.56ns per write (rand:0)
52 :time elapsed 0.56ns per write (rand:0)
53 :time elapsed 0.56ns per write (rand:0)
54 :time elapsed 0.55ns per write (rand:0)
55 :time elapsed 0.56ns per write (rand:0)
56 :time elapsed 0.56ns per write (rand:0)
57 :time elapsed 1.1ns per write (rand:0)
58 :time elapsed 1.1ns per write (rand:0)
59 :time elapsed 1.1ns per write (rand:0)
60 :time elapsed 1.1ns per write (rand:0)
61 :time elapsed 1.1ns per write (rand:0)
62 :time elapsed 1.1ns per write (rand:0)
63 :time elapsed 1ns per write (rand:0)
64 :time elapsed 0.56ns per write (rand:0)

Note that offsets 57 through 63 all take about 2x as long per write, and those are exactly the offsets that cross a 64-byte (cache line) boundary for an 8-byte write.

Thank you for your response. I am agree with you and I can find what you said in Intel manual. But I'd like to prove the penalty of crossing boundary in C++ code. For now, I still can not prove it. — Hankin, Dec 24 '21 at 01:29
@Hankin - I fxed up your program and added some output which shows the difference between split and non-split stores. You could find a similar effect for loads. — BeeOnRope, Dec 24 '21 at 02:22
Thanks, buddy. Your code can exactly prove. I have tested in 3 kinds of CPU architectures. For INTEL X86-64, from normal:0.5ns to cross cache line:0.75ns. For ARMV8, from 0.6ns to 0.85ns. For Hygon X86-64, from 0.87ns to 1.2ns. But there is a little difference. In INTEL X86-64, only offset=57-63 has penalty. But in the other 2 cpus, offset 9-15,25-31,41-47,57-63 have penalty. It seems like they rely on 16Byte align or something I guess. Their cache line are all 64Bytes. Do you have any idea about this? — Hankin, Dec 27 '21 at 03:21
@Hankin - yes, this is why I referred to "internal boundary" above, rather than just saying cache line. This effect occurs because internally the cache is using aligned banks/sectors or something similar, of a certain size: accesses within one bank/sector are efficiently handled on the fast path, but anything that crosses one will need two read/write from 2 different components, and combine parts generally taking twice as long. The size of this internal structure is an implementation detail and will vary across architectures. — BeeOnRope, Dec 29 '21 at 02:51

INTEL X86，why do align access and non-align access have same performance?

1 Answers1

Testing It

Linked