Aligning to cache line and knowing the cache line size

Question

To prevent false sharing, I want to align each element of an array to a cache line. So first I need to know the size of a cache line, so I assign each element that amount of bytes. Secondly I want the start of the array to be aligned to a cache line.

I am using Linux and 8-core x86 platform. First how do I find the cache line size. Secondly, how do I align to a cache line in C. I am using the gcc compiler.

So the structure would be following for example, assuming a cache line size of 64.

element[0] occupies bytes 0-63
element[1] occupies bytes 64-127
element[2] occupies bytes 128-191

and so on, assuming of-course that 0-63 is aligned to a cache line.

Perhaps this can help: http://stackoverflow.com/questions/794632/programmatically-get-the-cache-line-size — Tony The Lion, Sep 02 '11 at 09:46
Possible duplicate of [Programmatically get the cache line size?](http://stackoverflow.com/questions/794632/programmatically-get-the-cache-line-size) — Ciro Santilli OurBigBook.com, Mar 16 '17 at 07:56
It's not a bad idea to use a compile-time constant of 64 bytes as the cache-line size, so the compiler can bake that into functions that care about it. Making the compiler generate code for a runtime-variable cache line size could eat up some of the benefit of aligning things, especially in cases of auto-vectorization where it helps the compiler make better code if it knows a pointer is aligned to a cache line width (which is wider than the SIMD vector width). — Peter Cordes, Mar 12 '18 at 04:32

Maxim Egorushkin · Answer 1 · 2017-01-18T10:20:57.657

93

I am using Linux and 8-core x86 platform. First how do I find the cache line size.

$ getconf LEVEL1_DCACHE_LINESIZE
64

Pass the value as a macro definition to the compiler.

$ gcc -DLEVEL1_DCACHE_LINESIZE=`getconf LEVEL1_DCACHE_LINESIZE` ...

At run-time sysconf(_SC_LEVEL1_DCACHE_LINESIZE) can be used to get L1 cache size.

edited Jan 18 '17 at 10:20

answered Sep 02 '11 at 14:24

Maxim Egorushkin

131,725
17
180
271

1

Where are these `sysconf()`s specified? POSIX / IEEE Std 1003.1-20xx ? – Brian Cain Jun 16 '17 at 21:20
1

@BrianCain http://pubs.opengroup.org/onlinepubs/9699919799/functions/sysconf.html – Maxim Egorushkin Jun 16 '17 at 21:59
@BrianCain I use Linux, so I just did `man sysconf`. Linux is not exactly POSIX compilant, so that Linux-specific documentation is often more useful. Sometimes it is out of date, so you just `egrep -nH -r /usr/include -e '\b_SC'`. – Maxim Egorushkin Jun 16 '17 at 22:01
1

In case of Mac, use `sysctl hw.cachelinesize`. – Dení Jan 21 '18 at 00:12
Usually it's so much better to have a compile-time-constant line size that I'd rather hard-code 64 than call `sysconf`. The compiler won't even know it's a power of 2, so you'll have to manually do stuff like `offset = ptr & (linesize-1)` for remainder or bit-scan + right-shift to implement division. You can't just use `/` in code that's performance-sensitive. – Peter Cordes Nov 29 '19 at 13:29
But if you used a cross compiler that wouldn't work right? Because it would get the cache line size of you current architecture and not the one of your target architecture. – ilstam May 12 '20 at 11:14
@ilstam When cross-compiling you would need to obtain that `getconf LEVEL1_DCACHE_LINESIZE` from your target architecture, sure. Your build system might provide it, or you'd have to hardcode it as a system-specific value into your build system. – Maxim Egorushkin Jun 11 '20 at 22:34
@ilstam Another method is to have arch-specific implementations in different shared libraries and load the right one at run-time. Or, more advanced users, could have their own mechanisms of using arch-specific functions, but one would need to be an expert with all the details involved (which isn't rocket science, but requires a bit of thorough reading and appreciation). – Maxim Egorushkin Jun 11 '20 at 22:43

score 42 · Accepted Answer · edited Dec 18 '16 at 17:43

42

To know the sizes, you need to look it up using the documentation for the processor, afaik there is no programatic way to do it. On the plus side however, most cache lines are of a standard size, based on intels standards. On x86 cache lines are 64 bytes, however, to prevent false sharing, you need to follow the guidelines of the processor you are targeting (intel has some special notes on its netburst based processors), generally you need to align to 64 bytes for this (intel states that you should also avoid crossing 16 byte boundries).

To do this in C or C++ requires that you use the standard aligned_alloc function or one of the compiler specific specifiers such as __attribute__((align(64))) or __declspec(align(64)). To pad between members in a struct to split them onto different cache lines, you need on insert a member big enough to align it to the next 64 byte boundery

edited Dec 18 '16 at 17:43

Bradley Garagan

155
1
9

answered Sep 02 '11 at 09:50

Necrolis

25,836
3
63
101

But how do I align to a cache line in c? – MetallicPriest Sep 02 '11 at 09:52
@MetallicPriest: updated my post a bit (note: there was an error in cache line size, align to 64 bytes, not 16, 16 bytes is to prevent splitting) – Necrolis Sep 02 '11 at 10:05
2

@MetallicPriest: gcc _and_ g++ both support `__attributes__` – Sebastian Mach Sep 02 '11 at 10:06
Is memory mapped by mmap, aligned too? – MetallicPriest Sep 02 '11 at 10:33
1

@MetallicPriest: `mmap` & `VirtualAlloc` allocate page aligned memory, generally page granularity is 64kb (under windows), and since 64kb is a power of 64, it will be aligned properly. – Necrolis Sep 02 '11 at 10:45
1

You can get the cache line size programatically. Check [here](http://stackoverflow.com/questions/794632/programmatically-get-the-cache-line-size). Also you can not generalize to having 64 byte cache lines on x86. It is only true for recent ones. – tothphu Jun 20 '12 at 22:11
@tothphu: a more portable way to get it is via `CPUID`, and as of many revisions of the Intel guides, cache lines have been 64 bytes, IIRC even the P4 (which is now *ancient*) had 64 byte cachelines (in fact, it did, see: http://www.osronline.com/article.cfm?article=273). also there is no need to spam the link, rather just edit your comment. – Necrolis Jun 21 '12 at 07:16
@Necrolis I seem to remember that I have read 32 bytes somewhere in Core Duo timeframe, but then my memory is probaly deceiving me. Otherwise I couldn't edit the comment I have crossed some 5 min boundary. – tothphu Jun 22 '12 at 07:52
4

C++11 addes alignas that is portable way of specifying alignment – NoSenseEtAl Oct 19 '18 at 02:43
1

@NoSenseEtAl `alignas` officially only supports alignment up till the size of the type `std::max_align_t`, which is typically the alignment requirement of a `long double`, aka 8 or 16 bytes - not 64 unfortunately. See for example https://stackoverflow.com/questions/49373287/gcc-over-aligned-new-support-alignas – Carlo Wood Jul 20 '19 at 15:40
@CarloWood: Compilers are *allowed* to support over-aligned types, and in practice they do. (all of gcc, clang, MSVC, ICC support `alignas(64)`). True that ISO C++ only *requires* `alignas` up to `alignof(max_align_t)`, but it also doesn't specify `__declspec` or `__attribute__`. I'd call `alignas` portable because in real life compilers can and do support it because it's useful. Not in the same sense that behaviour required by ISO C++ is portable, sure. – Peter Cordes Nov 29 '19 at 13:13
@Necrolis: re: earlier comments: x86 (and x86-64) page size is 4kiB. x86-64 hugepages are 2MiB or 1GiB. Yes, everything uses 64-byte cache lines since Core 2 at least, so all x86-64. Pentium II/III did use 32-byte lines, maybe even Pentium M / Core solo/duo. Over-aligning might waste a bit of space on those ancient CPUs, but it's not a big deal. On modern CPUs, L2 spatial prefetch tries to complete an aligned pair of cache lines (128 bytes) so it can sometimes make sense to align by 128. – Peter Cordes Nov 29 '19 at 13:17

score 14 · Answer 3 · edited Dec 27 '20 at 14:23

14

Another simple way is to just cat the /proc/cpuinfo:

grep cache_alignment /proc/cpuinfo

edited Dec 27 '20 at 14:23

hugomg

68,213
24
160
246

answered Jun 02 '12 at 07:17

Francesquini

1,605
1
11
14

1

Perhaps you want to remove a useless use of cat. – maxschlepzig Oct 06 '19 at 17:57

score 9 · Answer 4 · answered Sep 02 '11 at 14:52

There's no completely portable way to get the cacheline size. But if you're on x86/64, you can call the cpuid instruction to get everything you need to know about the cache - including size, cacheline size, how many levels, etc...

http://softpixel.com/~cwright/programming/simd/cpuid.php

(scroll down a little bit, the page is about SIMD, but it has a section getting the cacheline.)

As for aligning your data structures, there's also no completely portable way to do it. GCC and VS10 have different ways to specify alignment of a struct. One way to "hack" it is to pad your struct with unused variables until it matches the alignment you want.

To align your mallocs(), all the mainstream compilers also have aligned malloc functions for that purpose.

score 8 · Answer 5 · answered Sep 02 '11 at 09:56

8

posix_memalign or valloc can be used to align allocated memory to a cache line.

answered Sep 02 '11 at 09:56

MetallicPriest

29,191
52
200
356

3

I know this is your own question, but for future readers you could answer both parts of it :-) – Steve Jessop Sep 02 '11 at 10:10
Steve, do you know if memory mapped by mmap is aligned to a cache line. – MetallicPriest Sep 02 '11 at 10:34
3

I don't think it's guaranteed by Posix, but I also wouldn't be in the least surprised if linux always selects addresses that are page-aligned, never mind just cache-line aligned. Posix says that if the caller specifies the first parameter (address hint), that has to be page-aligned, and the mapping itself is always a whole number of pages. That's strongly suggestive without actually guaranteeing anything. – Steve Jessop Sep 02 '11 at 10:45
1

Yes, `mmap` only works in terms of pages, and pages are always larger than cache lines. Even in some theoretical weird architecture, there are extremely good reasons why cache lines won't be larger than pages (caches are normally physically tagged, so one line can't be split across 2 virtual pages without extreme pain for the CPU designers). – Peter Cordes Mar 12 '18 at 04:29

score 3 · Answer 6 · answered Nov 29 '19 at 11:36

3

Here's a table I made that has most Arm/Intel processors on it. You can use it for reference when defining constants, that way you don't have to generalize the cache line size for all architectures.

For C++, hopefully, we will soon see hardware interface size which should be an accurate way to get this information (assuming you tell the compiler your target architecture).

answered Nov 29 '19 at 11:36

zoecarver

5,523
2
26
56

1

Compilers are reluctant to implement `hardware_destructive_interference_size` because you really want it to be a compile-time-constant, but it can't always be if you're compiling for a "generic" target that could run on multiple CPUs of the same ISA. A conservative choice would be possible but not guaranteed future-proof. (Like 128 bytes to account for current x86 CPU with 64-byte lines and an L2 spatial prefetch that likes to complete an aligned pair of lines. (mainstream Intel)) – Peter Cordes Nov 29 '19 at 13:34

score 2 · Answer 7 · answered Jan 23 '15 at 06:22

2

If anyone is curious about how to do this easily in C++, I've built a library with a CacheAligned<T> class which handles determining the cache line size as well as the alignment for your T object, referenced by calling .Ref() on your CacheAligned<T> object. You can also use Aligned<typename T, size_t Alignment> if you know the cache line size beforehand, or just want to stick with the very common value of 64 (bytes).

https://github.com/NickStrupat/Aligned

answered Jan 23 '15 at 06:22

Nick Strupat

4,928
4
44
56

@James - `alignas` is C++11. Its not available for C++03. And it won't work on a number of Apple platforms. On some of their OSes, Apple provides and ancient C++ Standard Library that pretends to be C++11, but lacks `unique_ptr`, `alignas`, etc. – jww Oct 13 '15 at 15:59
1

@James also, the standard only requires `alignas` to support up to 16 bytes, so any higher value won't be portable. And since virtually all modern processors have a cache line size of 64 bytes, `alignas` isn't useful unless you know your compiler supports `alignas(64)`. – Nick Strupat Apr 20 '16 at 06:09
1

`alignas` is also in C11, not just C++11. – Alnitak Nov 14 '18 at 15:39
`alignas` officially only supports alignment up till the size of the type `std::max_align_t`, which is typically the alignment requirement of a `long double`, aka 8 or 16 bytes - not 64 unfortunately. – Carlo Wood Jul 20 '19 at 15:41
1

@NickStrupat It seems that support for alignment to cache line sizes has finally been added to C++17. My last comment seems also not to be correct anymore for C++17 (the problem was merely that operator new would not guaranteed return memory aligned better than std::max_align_t). I just found this: https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size – Carlo Wood Jul 20 '19 at 16:14
@CarloWood You're right about the C++17 addition. The only advantage remaining for my library and its underlying `get_cachline_size` function is that it can retrieve that information at run-time. The downside is that you lose possible compiler optimizations if the cache line size is known at compile time. – Nick Strupat Jul 20 '19 at 16:41
@NickStrupat After posting this comment, I tried it out and discovered that neither gcc nor clang support it... Apparently they went for option 3 in http://lists.llvm.org/pipermail/cfe-dev/2018-May/058138.html (I read the whole thread; it's long but to summarize -- they have no clue how to implement it and were thinking about filing a Defect Report). Nevertheless, your library will of course have the exact same ABI/ODR issues. I'm starting to feel that simply using 64 bytes everywhere for now is my best option :/. – Carlo Wood Jul 20 '19 at 17:06

Aligning to cache line and knowing the cache line size

7 Answers7

Linked