In a direct mapped cache of 32 bytes with 16-byte blocks you will have 2 blocks. This means that for a given address you will have 1 bit to identify the block index, 4 bits for the block offset used to identify the particular byte inside the block (2 for word offset and 2 for byte offset), and the rest will be the tag.
The cache would look like this:
Valid? | Tag | Index | Data
-------+-----+-------+-----
| | 0 | (16 bytes)
-------+-----+-------+-----
| | 1 | (16 bytes)
-------+-----+-------+-----
Let's assume 8 bit addresses for simplicity (with longer addresses you only get a longer tag, nothing else changes). In case of 8 bit addresses you would have 3 bits for the tag, 1 for the block index and 4 for the block offset. So, given an address, e.g. 0x34
, you can decompose it like so:
block offset
tag |
001|1|0100
|
block index
Now assume that the two arrays are in memory one right after the other, like this:
x[0] | x[1] | ... | x[7] | y[0] | y[1] | ... | y[7]
If x
starts at address 0x0, we would have the following situation:
element address element address
x[0] 000|0|0000 (0) y[0] 001|0|0000 (32)
x[1] 000|0|0100 (4) y[1] 001|0|0100 (36)
x[2] 000|0|1000 (8) y[2] 001|0|1000 (40)
x[3] 000|0|1100 (12) y[3] 001|0|1100 (44)
x[4] 000|1|0000 (16) y[4] 001|1|0000 (48)
x[5] 000|1|0100 (20) y[5] 001|1|0100 (52)
x[6] 000|1|1000 (24) y[6] 001|1|1000 (56)
x[7] 000|1|1100 (28) y[7] 001|1|1100 (60)
| |
block index block index
As you can see, the problem here is that the block index is always the same between any two elements of x
and y
with the same array index. This means a cache miss will happen for every single array access, since you first access x[i]
and then y[i]
(or the opposite), and each time you have values for the wrong array in the cache block.
Now suppose you add the appropriate padding after the end of x
:
x[0] | x[1] | ... | x[7] | ...PADDING... | y[0] | y[1] | ... | y[7]
You are now in a much better situation:
element addr element addr
x[0] 000|0|0000 (0) y[0] 01|1|0000 (48)
x[1] 000|0|0100 (4) y[1] 01|1|0100 (52)
x[2] 000|0|1000 (8) y[2] 01|1|1000 (56)
x[3] 000|0|1100 (12) y[3] 01|1|1100 (60)
x[4] 000|1|0000 (16) y[4] 10|0|0000 (64)
x[5] 000|1|0100 (20) y[5] 10|0|0100 (68)
x[6] 000|1|1000 (24) y[6] 10|0|1000 (72)
x[7] 000|1|1100 (28) y[7] 10|0|1100 (76)
| |
block index block index
Indeed, this situation is optimal, you will only have 4 cache misses (x[0]
, y[0]
, x[4]
, y[4]
).
Why wouldn't the other options work? Well, let's see.
=== WITH x[9] =======================================
element address element address
x[0] 000|0|0000 (0) y[0] 001|0|0100 (36)
x[1] 000|0|0100 (4) y[1] 001|0|1000 (40)
x[2] 000|0|1000 (8) y[2] 001|0|1100 (44)
x[3] 000|0|1100 (12) y[3] 001|1|0000 (48)
x[4] 000|1|0000 (16) y[4] 001|1|0100 (52)
x[5] 000|1|0100 (20) y[5] 001|1|1000 (56)
x[6] 000|1|1000 (24) y[6] 001|1|1100 (60)
x[7] 000|1|1100 (28) y[7] 010|0|0000 (64)
=== WITH x[10] ======================================
element address element address
x[0] 000|0|0000 (0) y[0] 001|0|1000 (40)
x[1] 000|0|0100 (4) y[1] 001|0|1100 (44)
x[2] 000|0|1000 (8) y[2] 001|1|0000 (48)
x[3] 000|0|1100 (12) y[3] 001|1|0100 (52)
x[4] 000|1|0000 (16) y[4] 001|1|1000 (56)
x[5] 000|1|0100 (20) y[5] 001|1|1100 (60)
x[6] 000|1|1000 (24) y[6] 010|0|0000 (64)
x[7] 000|1|1100 (28) y[7] 010|0|0100 (68)
=== WITH x[11] =====================================
element address element address
x[0] 000|0|0000 (0) y[0] 001|0|1100 (44)
x[1] 000|0|0100 (4) y[1] 001|1|0000 (48)
x[2] 000|0|1000 (8) y[2] 001|1|0100 (52)
x[3] 000|0|1100 (12) y[3] 001|1|1000 (56)
x[4] 000|1|0000 (16) y[4] 001|1|1100 (60)
x[5] 000|1|0100 (20) y[5] 010|0|0000 (64)
x[6] 000|1|1000 (24) y[6] 010|0|0100 (68)
x[7] 000|1|1100 (28) y[7] 010|0|1000 (72)
=== WITH x[16] =====================================
element address element address
x[0] 000|0|0000 (0) y[0] 010|0|0000 (64)
x[1] 000|0|0100 (4) y[1] 010|0|0100 (68)
x[2] 000|0|1000 (8) y[2] 010|0|1000 (72)
x[3] 000|0|1100 (12) y[3] 010|0|1100 (76)
x[4] 000|1|0000 (16) y[4] 010|1|0000 (80)
x[5] 000|1|0100 (20) y[5] 010|1|0100 (84)
x[6] 000|1|1000 (24) y[6] 010|1|1000 (88)
x[7] 000|1|1100 (28) y[7] 010|1|1100 (92)
As you can easily tell from the above, none of those choices of padding are optimal. x[16]
is basically identical to the worst case, you have too much padding. x[9]
only solves the problem for 1/4 of the elements, x[10]
for 2/4 of the elements and x[11]
for 3/4 of the elements.
So, exactly as you say, those options give varying changes in the percentage of cache misses present in the code, but none of them is optimal (i.e. lowest miss rate possible).
The only best configuration is one where you have any padding such that x
starts with block index 0
and y
starts with block index 1
, then x[4]
is at block index 1 and y[4]
is at block index 0. Either this, or the complete opposite. So good values for the size of array x
are 8 * N + 4
for any N >= 1, with the smallest being 12
.