64

is there a way in C++ to determine the CPU's cache size? i have an algorithm that processes a lot of data and i'd like to break this data down into chunks such that they fit into the cache. Is this possible? Can you give me any other hints on programming with cache-size in mind (especially in regard to multithreaded/multicore data processing)?

Thanks!

Ian Ringrose
  • 51,220
  • 55
  • 213
  • 317
Mat
  • 841
  • 1
  • 9
  • 5
  • My experiments with blocked algorithms shows me that it interfere with GCC optimizations. The optimal block size is not always the l1 cache size. I suggest making tests with different block sizes. – Dr. Jekyll May 05 '17 at 16:34

10 Answers10

17

According to "What every programmer should know about memory", by Ulrich Drepper you can do the following on Linux:

Once we have a formula for the memory requirement we can compare it with the cache size. As mentioned before, the cache might be shared with multiple other cores. Currently {There definitely will sometime soon be a better way!} the only way to get correct information without hardcoding knowledge is through the /sys filesystem. In Table 5.2 we have seen the what the kernel publishes about the hardware. A program has to find the directory:

/sys/devices/system/cpu/cpu*/cache

This is listed in Section 6: What Programmers Can Do.

He also describes a short test right under Figure 6.5 which can be used to determine L1D cache size if you can't get it from the OS.

There is one more thing I ran across in his paper: sysconf(_SC_LEVEL2_CACHE_SIZE) is a system call on Linux which is supposed to return the L2 cache size although it doesn't seem to be well documented.

Robert S. Barnes
  • 39,711
  • 30
  • 131
  • 179
12

C++ itself doesn't "care" about CPU caches, so there's no support for querying cache-sizes built into the language. If you are developing for Windows, then there's the GetLogicalProcessorInformation()-function, which can be used to query information about the CPU caches.

kusma
  • 6,516
  • 2
  • 22
  • 26
  • 1
    ? Of course C++ cares about CPU caches. Not caring about that would mean being like 100-1000 times slower than you could be in hot paths. And no support for that is built into the language because number one not all systems support it and number two knowing the cache size is mostly completely irrelevant without also having full control over memory allocation strategies. C++ gives you a lot but not enough to play "cache daddy" – thesaint Jun 30 '15 at 20:40
  • If anything, your C/C++ **compiler** cares about caching. You can use the `-march` flag on GCC to let it optimize for some specific CPU. There's also this: http://en.cppreference.com/w/cpp/atomic/memory_order – snowflake Feb 25 '18 at 10:42
9

Preallocate a large array. Then access each element sequentially and record the time for each access. Ideally there will be a jump in access time when cache miss occurs. Then you can calculate your L1 Cache. It might not work but worth trying.

ben
  • 91
  • 1
  • 1
4

You can see this thread: http://software.intel.com/en-us/forums/topic/296674

The short answer is in this other thread:

On modern IA-32 hardware, the cache line size is 64. The value 128 is a legacy of the Intel Netburst Microarchitecture (e.g. Intel Pentium D) where 64-byte lines are paired into 128-byte sectors. When a line in a sector is fetched, the hardware automatically fetches the other line in the sector too. So from a false sharing perspective, the effective line size is 128 bytes on the Netburst processors. (http://software.intel.com/en-us/forums/topic/292721)

Daniel Munoz
  • 1,865
  • 1
  • 14
  • 9
4

read the cpuid of the cpu (x86) and then determine the cache-size by a look-up-table. The table has to be filled with the cache sizes the manufacturer of the cpu publishes in its programming manuals.

Tobias Langner
  • 10,634
  • 6
  • 46
  • 76
  • 2
    hey that sounds interesting! are there any such precomposed tables available online? – Mat Dec 17 '09 at 14:55
  • http://www.x86-guide.com/en/index.html Might have such tables. However, the problem with this is what do you do with an unidentified cpu and do you want to have to update the program every time a new cpu comes out? – Robert S. Barnes Dec 24 '09 at 09:38
  • 1
    Doesn't this solution break down though if your program is used on a CPU released after your program is released? – Billy ONeal Aug 29 '10 at 20:43
  • @Billy ONeal: Not necessarily. `CPUID` has a "sub-identification" feature that can be used to query cache sizes/associativities/line-sizes on every CPU that supports the query, see http://en.wikipedia.org/wiki/CPUID#EAX.3D80000006h:_Extended_L2_Cache_Features – FrankH. Aug 19 '12 at 08:13
  • @FrankH: Tobias' answer says use a lookup table, not use the sub-identification feature of CPUID. – Billy ONeal Aug 19 '12 at 23:11
4

Depending on what you're trying to do, you might also leave it to some library. Since you mention multicore processing, you might want to have a look at Intel Threading Building Blocks.

TBB includes cache aware memory allocators. More specifically, check cache_aligned_allocator (in the reference documentation, I couldn't find any direct link).

Stéphane Bonniez
  • 4,577
  • 3
  • 21
  • 16
4

Interestingly enough, I wrote a program to do this awhile ago (in C though, but I'm sure it will be easy to incorporate in C++ code).

http://github.com/wowus/CacheLineDetection/blob/master/Cache%20Line%20Detection/cache.c

The get_cache_line function is the interesting one, which returns the location of right before the biggest spike in timing data of array accesses. It correctly guessed on my machine! If anything else, it can help you make your own.

It's based off of this article, which originally piqued my interest: http://igoro.com/archive/gallery-of-processor-cache-effects/

Clark Gaebel
  • 17,280
  • 20
  • 66
  • 93
2

There are two cases that need to be distinguished. Do you need to know the cache sizes at compile time or at runtime?

Determining the cache-size at compile-time

For some applications, you know the exact architecture that your code will run on, for example, if you can compile the code directly on the host machine. In that case, simplify looking up the size and hard-coding it is an option (could be automated in the build system). On most machines today, the L1 cache line should be 64 bytes.

If you want to avoid that complexity or if you need to support compilation on unknown architectures, you can use the C++17 feature std::hardware_constructive_interference_size as a good fallback. It will provide a compile-time estimation for the cache line, but be aware of its limitations. Note that the compiler cannot guess perfectly when it creates the binary, as the size of the cache-line is, in general, architecture dependent.

Determining the cache-size at runtime

At runtime, you have the advantage that you know the exact machine, but you will need platform specific code to read the information from the OS. A good starting point is the code snippet from this answer, which supports the major platforms (Windows, Linux, MacOS). In a similar fashion, you can also read the L2 cache size at runtime.

I would advise against trying to guess the cache line by running benchmarks at startup and measuring which one performed best. It might well work, but it is also error-prone if the CPU is used by other processes.

Combining both approaches

If you have to ship one binary and the machines that it will later run on features a range of different architectures with varying cache sizes, you could create specialized code parts for each cache size, and then dynamically (at application startup) choose the best fitting one.

Philipp Claßen
  • 41,306
  • 31
  • 146
  • 239
1

IIRC, GCC has a __builtin_prefetch hint.

http://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Other-Builtins.html

has an excellent section on this. Basically, it suggests:

__builtin_prefetch (&array[i + LookAhead], rw, locality);

where rw is a 0 (prepare for read) or 1 (prepare for a write) value, and locality uses the number 0-3, where zero is no locality, and 3 is very strong locality.

Both are optional. LookAhead would be the number of elements to look ahead to. If memory access were 100 cycles, and the unrolled loops are two cycles apart, LookAhead could be set to 50 or 51.

Max
  • 121
  • 3
-1

The cache will usually do the right thing. The only real worry for normal programmer is false sharing, and you can't take care of that at runtime because it requires compiler directives.