How to find the size of the L1 cache line size with IO timing measurements?

Question

As a school assignment, I need to find a way to get the L1 data cache line size, without reading config files or using api calls. Supposed to use memory accesses read/write timings to analyze & get this info. So how might I do that?

In an incomplete try for another part of the assignment, to find the levels & size of cache, I have:

for (i = 0; i < steps; i++) {
    arr[(i * 4) & lengthMod]++;
}

I was thinking maybe I just need vary line 2, (i * 4) part? So once I exceed the cache line size, I might need to replace it, which takes sometime? But is it so straightforward? The required block might already be in memory somewhere? Or perpahs I can still count on the fact that if I have a large enough steps, it will still work out quite accurately?

UPDATE

Heres an attempt on GitHub ... main part below

// repeatedly access/modify data, varying the STRIDE
for (int s = 4; s <= MAX_STRIDE/sizeof(int); s*=2) {
    start = wall_clock_time();
    for (unsigned int k = 0; k < REPS; k++) {
        data[(k * s) & lengthMod]++;
    }
    end = wall_clock_time();
    timeTaken = ((float)(end - start))/1000000000;
    printf("%d, %1.2f \n", s * sizeof(int), timeTaken);
}

Problem is there dont seem to be much differences between the timing. FYI. since its for L1 cache. I have SIZE = 32 K (size of array)

The C tag has been added - @JiewMeng, perhaps you would confirm that you are indeed writing in C. I've removed the homework tag (as per http://meta.stackexchange.com/questions/147100/the-homework-tag-is-now-officially-deprecated?cb=1) — Dan Puzey, Oct 01 '12 at 14:21
You can use assembly and then CPUID instruction (it's a processor instruction, not an API) to get this information. I know you are probably not looking for a solution like this one, but anyway I think It's worth to share... — Hugo Corrá, Oct 05 '12 at 14:31
[This question](http://stackoverflow.com/questions/8547778/why-is-one-loop-so-much-slower-than-two-loops) might give you some ideas. It doesn't measure the cache sizes, but it does show significant performance drops at each cache level. — Mysticial, Oct 12 '12 at 00:04
A less reverse engineering focused question: http://stackoverflow.com/questions/794632/programmatically-get-the-cache-line-size — Ciro Santilli OurBigBook.com, Mar 16 '17 at 07:56
Semi-related: http://igoro.com/archive/gallery-of-processor-cache-effects/ has lots of interesting stuff about cache effects. — Peter Cordes, May 12 '18 at 04:48

Alex D · Accepted Answer · 2012-10-02T08:41:58.030

30

Allocate a BIG char array (make sure it is too big to fit in L1 or L2 cache). Fill it with random data.

Start walking over the array in steps of n bytes. Do something with the retrieved bytes, like summing them.

Benchmark and calculate how many bytes/second you can process with different values of n, starting from 1 and counting up to 1000 or so. Make sure that your benchmark prints out the calculated sum, so the compiler can't possibly optimize the benchmarked code away.

When n == your cache line size, each access will require reading a new line into the L1 cache. So the benchmark results should get slower quite sharply at that point.

If the array is big enough, by the time you reach the end, the data at the beginning of the array will already be out of cache again, which is what you want. So after you increment n and start again, the results will not be affected by having needed data already in the cache.

edited Oct 02 '12 at 08:41

answered Oct 01 '12 at 14:54

Alex D

29,755
7
80
126

7

Probably HW prefetching will figure out steps of 'n', and will load ahead of you. – auselen Oct 01 '12 at 21:21
@auselen, is it possible to disable HW prefetching temporarily? – Jiew Meng Oct 01 '12 at 22:57
Can I sum `char`? Or do u mean declare an int array? `int` will be 4 bytes tho? – Jiew Meng Oct 02 '12 at 04:46
3

I think this idea should work, however try to take n steps in a random way to avoid prefetching, something like n + (r * c), where c is a 2's power value bigger than possible cache line size and r is a random value. you need to be sure that n + (r * c) is in your array probably using modulo. – auselen Oct 02 '12 at 05:42
1

I guess it is also fair to make some assumptions like, cache line size must be 2's power, at least 32 bytes, max 512 bytes. – auselen Oct 02 '12 at 05:43
@auselen -- I also thought about the effects of prefetching before writing this answer, but I still recommend that the OP try this method first, and see how the results work out. The inner loop of this benchmark should compile down to ~4 instructions (~3 if he uses sentinels). If you have 1 cache miss **every 4 instructions**, I don't think even prefetching will help you much. Other readers who know CPU hardware better than me are free to correct me here! – Alex D Oct 02 '12 at 08:38
@JiewMeng, sure, you can sum `char`s. Why not? The sum can be either an `int` or a `char`. – Alex D Oct 02 '12 at 08:40
Hmm ... heres **[my attempt on GitHub](https://github.com/jiewmeng/cs3210-assign1/blob/master/cache-l1-line.cpp)**. The results doesnt show much difference with all times being ~0.5s – Jiew Meng Oct 05 '12 at 10:37
2

Your `data` array is only 32KB, so the whole thing will fit in L1 cache. Please pay attention to the first thing I said above: "Allocate a BIG char array. Make sure it is too big to fit in L1 or L2 cache.". – Alex D Oct 05 '12 at 19:21
I modified your code a bit and ran it on my system. Results: 4, 0.02 8, 0.01 16, 0.03 32, 0.10 64, 0.23 128, 0.30 256, 0.31 512, 0.27. The biggest jump seems to be from 32 bytes to 64 bytes, so I guessed that my cache line size is probably 64 bytes. An Internet search confirmed this is true! – Alex D Oct 05 '12 at 19:23
1

@JiewMeng, I'm not posting my code right now, because you will learn more by figuring out how to fix your code yourself. After you figure it out, I can send you my code for comparison. – Alex D Oct 06 '12 at 04:20
At first I was also thinking of allocating a large array, however the question asked for L1 cache line size, not L2/3, so why do I allocate a large array? Won't it not fit in the cache, and thus use the L2/3 instead? Hmm or maybe it will partially use L1? – Jiew Meng Oct 06 '12 at 11:49
1

@JiewMeng, the experiment you already did shows why you need to allocate a large array. If the whole array fits in L1 cache, it will all get loaded into cache on the first iteration of the outer loop, and on subsequent iterations, there will be no misses. That's why your results were all the same. The whole point of this exercise is to find the stride at which *every* iteration of the inner loop will result in a L1 cache miss. To do that, you need to make the array big enough that by the time you reach the end, the beginning part has already been displaced from the cache. – Alex D Oct 06 '12 at 16:49
1

BTW, your "modulo" code is not needed and will dilute the results (depending on the modulo value, it may even render the test ineffective). Just make the array big enough that for each run of the inner loop, you can stride over it linearly without running off the end. – Alex D Oct 06 '12 at 16:52
Hmm but I need that to keep the references to the bounds of the data array? Else I get segmentation fault – Jiew Meng Oct 07 '12 at 00:30
Did updated the source on GitHub again. The results doesnt seem to clear still. I do understand that if the `MAX_STRIDE` gets too large theres some chance `&` will cause a reference to the a memory location still in the cache but I dont think thats happening ... my SIZE (MB) is very large compared to the MAX_STRIDE (in bytes) – Jiew Meng Oct 07 '12 at 02:37
I used 128MB for SIZE, and made REPS small enough that it could never run off the end and get a segmentation fault, even with the maximum stride. – Alex D Oct 07 '12 at 12:58
Note that if the test works as expected, you should get a jump in the measured time when the stride *equals*, not exceeds, the cache line size. – Alex D Oct 07 '12 at 12:59
I also ran `gcc` with optimizations on (`-O5`), which might tighten you inner loop, thus making the effect of cache misses more obvious. Eliminating your modulo will also tighten up the inner loop. You don't need the multiply in the inner loop (can you see how to achieve the same effect without it?), but hopefully the compiler will optimize that for you. – Alex D Oct 07 '12 at 13:01
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/17671/discussion-between-jiew-meng-and-alex-d) – Jiew Meng Oct 07 '12 at 13:46
@JiewMeng, I haven't been able to catch you in the chat room. I suggest you send me your e-mail address. – Alex D Oct 08 '12 at 07:57
1

I don't think this will work. You'll find the critical stride with this method, but you still need the number of lines per set to calculate the line size. – Luchian Grigore Oct 10 '12 at 20:53
@LuchianGrigore, this idea has nothing to do with the critical stride. The idea is this: if you walk through memory in steps smaller than the cache line size, then after a cache miss, the next read is guaranteed to be a cache hit. Say each line is 32 bytes, and you are striding through memory in steps of 10 bytes. You access the first byte of a 32-byte region, which causes the whole 32 bytes to be read into cache. Now the next 2 reads are both guaranteed cache hits. If you stride in units of the line size or bigger, each read will land on a different line, so you won't get those cache hits. – Alex D Oct 12 '12 at 19:43
@LucianGrigore: to continue -- I advised the OP to use an array larger than L1 *or* L2 cache. By repeatedly striding over a *BIG* array, the OP can guarantee that once a cache line is passed by, it is *guaranteed* to be removed from cache by the next iteration of the outer loop (regardless of issues of set-associativity, critical stride, and so on). – Alex D Oct 12 '12 at 19:48
@LuchianGrigore: the stride isn't to create conflict misses, it's to see that a stride twice the cache line size only touches half the number of cache lines, while a stride smaller than the cache line size benefits from spatial locality. HW prefetch could confound this some, especially if adjacent-line prefetch wastes bandwidth filling the lines you skip with stride=128B on a system with 64B lines. (On Intel CPUs, HW prefetch is supposed to throttle when there are lots of demand misses outstanding, so this should still work if you get the compiler to produce the right asm...) – Peter Cordes May 12 '18 at 04:55

auselen · Answer 2 · 2012-10-11T20:06:26.633

Have a look at Calibrator, all of the work is copyrighted but source code is freely available. From its document idea to calculate cache line sizes sounds much more educated than what's already said here.

The idea underlying our calibrator tool is to have a micro benchmark whose performance only depends on the frequency of cache misses that occur. Our calibrator is a simple C program, mainly a small loop that executes a million memory reads. By changing the stride (i.e., the offset between two subsequent memory accesses) and the size of the memory area, we force varying cache miss rates.

In principle, the occurance of cache misses is determined by the array size. Array sizes that fit into the L1 cache do not generate any cache misses once the data is loaded into the cache. Analogously, arrays that exceed the L1 cache size but still fit into L2, will cause L1 misses but no L2 misses. Finally, arrays larger than L2 cause both L1 and L2 misses.

The frequency of cache misses depends on the access stride and the cache line size. With strides equal to or larger than the cache line size, a cache miss occurs with every iteration. With strides smaller than the cache line size, a cache miss occurs only every n iterations (on average), where n is the ratio cache line size/stride.

Thus, we can calculate the latency for a cache miss by comparing the execution time without misses to the execution time with exactly one miss per iteration. This approach only works, if memory accesses are executed purely sequential, i.e., we have to ensure that neither two or more load instructions nor memory access and pure CPU work can overlap. We use a simple pointer chasing mechanism to achieve this: the memory area we access is initialized such that each load returns the address for the subsequent load in the next iteration. Thus, super-scalar CPUs cannot benefit from their ability to hide memory access latency by speculative execution.

To measure the cache characteristics, we run our experiment several times, varying the stride and the array size. We make sure that the stride varies at least between 4 bytes and twice the maximal expected cache line size, and that the array size varies from half the minimal expected cache size to at least ten times the maximal expected cache size.

I had to comment out #include "math.h" to get it compiled, after that it found my laptop's cache values correctly. I also couldn't view postscript files generated.

For my machine (Haswell) Calibrator predicts line size incorrectly, and @AlexD's approach also doesn't work. The problem is the prefetcher, that manages to guess constant-stride patterns and spoof the experiment. I suppose this can be measured with the prefetcher disabled — Eli Bendersky, Sep 27 '15 at 21:17

score 5 · Answer 3 · edited Jun 20 '20 at 09:12

You can use the CPUID function in assembler, although non portable, it will give you what you want.

For Intel Microprocessors, the Cache Line Size can be calculated by multiplying bh by 8 after calling cpuid function 0x1.

For AMD Microprocessors, the data Cache Line Size is in cl and the instruction Cache Line Size is in dl after calling cpuid function 0x80000005.

I took this from this article here.

score 2 · Answer 4 · answered Oct 07 '12 at 17:30

I think you should write program, that will walk throught array in random order instead straight, because modern process do hardware prefetch. For example, make array of int, which values will number of next cell. I did similar program 1 year ago http://pastebin.com/9mFScs9Z Sorry for my engish, I am not native speaker.

score 1 · Answer 5 · answered Oct 01 '12 at 15:04

1

See how to memtest86 is implemented. They measure and analyze data transfer rate in some way. Points of rate changing is corresponded to size of L1, L2 and possible L3 cache size.

answered Oct 01 '12 at 15:04

vitaly.v.ch

2,485
4
26
36

memory bandwidth dropoffs for larger arrays can tell you the total size of L1d / L2 / L3, but this question is asking about the size of each line, i.e. the cache block size. – Peter Cordes May 12 '18 at 04:46

score 1 · Answer 6 · edited Jun 20 '20 at 09:12

1

If you get stuck in the mud and can't get out, look here.

There are manuals and code that explain how to do what you're asking. The code is pretty high quality as well. Look at "Subroutine library".

The code and manuals are based on X86 processors.

edited Jun 20 '20 at 09:12

Community

1
1

answered Oct 02 '12 at 09:32

JimR

15,513
2
20
26

score 1 · Answer 7 · answered May 12 '18 at 14:27

1

Just a note.

Cache line size is variable on few ARM Cortex families and can change during execution without any notifications to a current program.

answered May 12 '18 at 14:27

vitaly.v.ch

2,485
4
26
36

score 0 · Answer 8 · answered Oct 01 '12 at 14:25

I think it should be enough to time an operation that uses some amount of memory. Then progresively increase the memory (operands for instance) used by the operation. When the operation performance severelly decreases you have found the limit.

I would go with just reading a bunch of bytes without printing them (printing would hit the performance so bad that would become a bottleneck). While reading, the timing should be directly proportinal to the ammount of bytes read until the data cannot fit the L1 anymore, then you will get the performance hit.

You should also allocate the memory once at the start of the program and before starting to count time.

His assignment is not to find the size of the L1 cache, but the size of *one cache line* in L1. — Alex D, Oct 01 '12 at 14:49

How to find the size of the L1 cache line size with IO timing measurements?

8 Answers8

Linked