Unaligned access performance on Intel x86 vs AMD x86 CPUs

Question

I have implemented a simple linear probing hash map with an array of structs memory layout. The struct holds the key, the value, and a flag indicating whether the entry is valid. By default, this struct gets padded by the compiler, as key and value are 64-bit integers, but the entry only takes up 8 bools. Hence, I have also tried packing the struct at the cost of unaligned access. I was hoping to get better performance from the packed/unaligned version due to higher memory density (we do not waste bandwidth on transferring padding bytes).

When benchmarking this hash map on an Intel Xeon Gold 5220S CPU (single-threaded, gcc 11.2, -O3 and -march=native), I see no performance difference between the padded version and the unaligned version. However, on an AMD EPYC 7742 CPU (same setup), I find a performance difference between unaligned and padded. Here is a graph depicting the results for hash map load factors 25 % and 50 %, for different successful query rates on the x axis (0,25,50,75,100): As you can see, on Intel, the grey and blue (circle and square) lines almost overlap, the benefit of struct packing is marginal. On AMD, however, the line representing unaligned/packed structs is consistently higher, i.e., we have more throughput.

In order to investigate this, I tried to built a smaller microbenchmark. In this microbenchmark, we perform a similar benchmark, but without the hash map find logic (i.e., we just pick random indices in the array and advance a little there). Please find the benchmark here:

#include <atomic>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <random>
#include <vector>

void ClobberMemory() { std::atomic_signal_fence(std::memory_order_acq_rel); }

template <typename T>
void doNotOptimize(T const& val) {
  asm volatile("" : : "r,m"(val) : "memory");
}

struct PaddedStruct {
  uint64_t key;
  uint64_t value;
  bool is_valid;

  PaddedStruct() { reset(); }

  void reset() {
    key = uint64_t{};
    value = uint64_t{};
    is_valid = 0;
  }
};

struct PackedStruct {
  uint64_t key;
  uint64_t value;
  uint8_t is_valid;

  PackedStruct() { reset(); }

  void reset() {
    key = uint64_t{};
    value = uint64_t{};
    is_valid = 0;
  }
} __attribute__((__packed__));

int main() {
  const uint64_t size = 134217728;
  uint16_t repetitions = 0;
  uint16_t advancement = 0;

  std::cin >> repetitions;
  std::cout << "Got " << repetitions << std::endl;
  std::cin >> advancement;
  std::cout << "Got " << advancement << std::endl;
  std::cout << "Initializing." << std::endl;

  std::vector<PaddedStruct> padded(size);
  std::vector<PackedStruct> unaligned(size);
  std::vector<uint64_t> queries(size);

  // Initialize the structs with random values + prefault
  std::random_device rd;
  std::mt19937 gen{rd()};
  std::uniform_int_distribution<uint64_t> dist{0, 0xDEADBEEF};
  std::uniform_int_distribution<uint64_t> dist2{0, size - advancement - 1};

  for (uint64_t i = 0; i < padded.size(); ++i) {
    padded[i].key = dist(gen);
    padded[i].value = dist(gen);
    padded[i].is_valid = 1;
  }

  for (uint64_t i = 0; i < unaligned.size(); ++i) {
    unaligned[i].key = padded[i].key;
    unaligned[i].value = padded[i].value;
    unaligned[i].is_valid = 1;
  }

  for (uint64_t i = 0; i < unaligned.size(); ++i) {
    queries[i] = dist2(gen);
  }

  std::cout << "Running benchmark." << std::endl;

  ClobberMemory();
  auto start_padded = std::chrono::high_resolution_clock::now();
  PaddedStruct* padded_ptr = nullptr;
  uint64_t sum = 0;
  for (uint16_t j = 0; j < repetitions; j++) {
    for (const uint64_t& query : queries) {
      for (uint16_t i = 0; i < advancement; i++) {
        padded_ptr = &padded[query + i];
        if (padded_ptr->is_valid) [[likely]] {
          sum += padded_ptr->value;
        }
      }
      doNotOptimize(sum);
    }
  }

  ClobberMemory();
  auto end_padded = std::chrono::high_resolution_clock::now();
  uint64_t padded_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_padded - start_padded).count());
  std::cout << "Padded Runtime (ms): " << padded_runtime << " (sum = " << sum << ")" << std::endl;  // print sum to avoid that it gets optimized out

  ClobberMemory();
  auto start_unaligned = std::chrono::high_resolution_clock::now();
  uint64_t sum2 = 0;
  PackedStruct* packed_ptr = nullptr;
  for (uint16_t j = 0; j < repetitions; j++) {
    for (const uint64_t& query : queries) {
      for (uint16_t i = 0; i < advancement; i++) {
        packed_ptr = &unaligned[query + i];
        if (packed_ptr->is_valid) [[likely]] {
          sum2 += packed_ptr->value;
        }
      }
      doNotOptimize(sum2);
    }
  }
  ClobberMemory();
  auto end_unaligned = std::chrono::high_resolution_clock::now();
  uint64_t unaligned_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_unaligned - start_unaligned).count());
  std::cout << "Unaligned Runtime (ms): " << unaligned_runtime << " (sum = " << sum2 << ")" << std::endl;
}

When running the benchmark, I pick repetitions = 3 and advancement = 5, i.e., after compiling and running it, you have to enter 3 (and press newline) and then enter 5 and press enter/newline. I updated the source code to (a) avoid loop unrolling by the compiler because repetition/advancement were hardcoded and (b) switch to pointers into that vector as it more closely resembles what the hash map is doing.

On the Intel CPU, I get:

Padded Runtime (ms): 13204 Unaligned Runtime (ms): 12185

On the AMD CPU, I get:

Padded Runtime (ms): 28432 Unaligned Runtime (ms): 22926

So while in this microbenchmark, Intel still benefits a little from the unaligned access, for the AMD CPU, both the absolute and relative improvement is higher. I cannot explain this. In general, from what I've learned from relevant SO threads, unaligned access for a single member is just as expensive as aligned access, as long as it stays within a single cache line (1). Also in (1), a reference to (2) is given, which claims that the cache fetch width can differ from the cache line size. However, except for Linus Torvalds mail, I could not find any other documentation of cache fetch widths in processors and especially not for my concrete two CPUs to figure out if that might somehow have to do with this.

Does anybody have an idea why the AMD CPU benefits much more from the struct packing? If it is about reduced memory bandwidth consumption, I should be able to see the effects on both CPUs. And if the bandwidth usage is similar, I do not understand what might be causing the differences here.

Thank you so much.

(1) Relevant SO thread: How can I accurately benchmark unaligned access speed on x86_64?

(2) https://www.realworldtech.com/forum/?threadid=168200&curpostid=168779

score 3 · Answer 1 · answered Jun 03 '22 at 23:50

The L1 Data Cache fetch width on the Intel Xeon Gold 5220S (and all the other Skylake/CascadeLake Xeon processors) is up to 64 naturally-aligned Bytes per cycle per load.

The core can execute two loads per cycle for any combination of size and alignment that does not cross a cacheline boundary. I have not tested all the combinations on the SKX/CLX processors, but on Haswell/Broadwell, throughput was reduced to one load per cycle whenever a load crossed a cacheline boundary, and I would assume that SKX/CLX are similar. This can be viewed as necessary feature rather than a "penalty" -- a line-splitting load might need to use both ports to load a pair of adjacent lines, then combine the requested portions of the lines into a payload for the target register.

Loads that cross page boundaries have a larger performance penalty, but to measure it you have to be very careful to understand and control the locations of the page table entries for the two pages: DTLB, STLB, in the caches, or in main memory. My recollection is that the most common case is pretty fast -- partly because the "Next Page Prefetcher" is pretty good at pre-loading the PTE entry for the next page into the TLB before a sequence of loads gets to the end of the first page. The only case that is painfully slow is for stores that straddle a page boundary, and the Intel compiler works very hard to avoid this case.

I have not looked at the sample code in detail, but if I were performing this analysis, I would be careful to pin the processor frequency, measure the instruction and cycle counts, and compute the average number of instructions and cycles per update. (I usually set the core frequency to the nominal (TSC) frequency just to make the numbers easier to work with.) For the naturally-aligned cases, it should be pretty easy to look at the assembly code and estimate what the cycle counts should be. If the measurements are similar to observations for that case, then you can begin looking at the overhead of unaligned accesses in reference to a more reliable understanding of the baseline.

Hardware performance counters can be valuable for this case as well, particularly the DTLB_LOAD_MISSES events and the L1D.REPLACEMENT event. It only takes a few high-latency TLB miss or L1D miss events to skew the averages.

My testing on Skylake-client agrees with that: half throughput for cl split loads that don't split across a 4k boundary. (And replays of uops dependent on them, if they dispatched eagerly in anticipation of the fast case.) (NASM source, `perf` results, and writeup on [How can I accurately benchmark unaligned access speed on x86\_64?](https://stackoverflow.com/a/45129784)). I think BeeOnRope and I mostly tested scalar and maybe XMM, but this sounds familiar for YMM, too. 4k splits are lower throughput, like 1 per ~3.8c, but not a total disaster on SKL and later, unlike Haswell. — Peter Cordes, Jun 05 '22 at 03:16
Thank you very much for your answer. However, (1) the auto padded struct by the compiler is not naturally aligned. It is 24 bytes large and thereby can cross the 64 B cache line boundary of both the Intel and AMD system. The core question I have is: Why do Intel and AMD behave differently here, and I still do not understand the reason for this. On Intel, there is no performance difference between unaligned access and aligned access, while on AMD, there is a speedup for unaligned access. I still do not see how I can explain this behavior... Any further help is very much appreciated. — Maxbit, Jun 05 '22 at 14:29
(I did some further digging with `perf` and found out that on both systems, the unaligned version takes more instructions to access the struct members, but fewer memory loads. However, as the AMD system actually has higher single threaded memory read bandwidth, I still do not know how to explain this) — Maxbit, Jun 05 '22 at 14:30
And for both systems, the unaligned version takes less cycles according to perf than the aligned version, despite having more instructions. According to rusage, more "system time" is spent for the unaligned version, but same amount of user time on Intel. On AMD, it is also slightly more system time, but not as much. — Maxbit, Jun 05 '22 at 15:07

score 0 · Answer 2 · answered Jun 06 '22 at 21:16

0

The number of cache-line accesses when using 24-byte data structures may be the same as when using 17-byte data structure.

Please see this blog post: https://lemire.me/blog/2022/06/06/data-structure-size-and-cache-line-accesses/

answered Jun 06 '22 at 21:16

Daniel Lemire

3,470
2
25
23

Unaligned access performance on Intel x86 vs AMD x86 CPUs

2 Answers2

Linked