Why can adding padding make your loop faster?

Question

People told me that adding padding can help to have better performance because it's using the cache in a better way.

I don't understand how is it possible that by making your data bigger you get better performance.

Can someone understand why?

Your memory controller is able to access aligned values (8 Byte on 64bit system or 4 Byte on 32 Bit systems) faster than not aligned (one instead of two loads). But your compiler does this trick. Therefore, if you program in a high level language like C++ you don't have to care. — user6556709, Mar 06 '19 at 09:42
@user6556709 is there any other case you need adding padding for performance? If I have a C array of type A witch is Two byte, if I do array[1] this will be not aligned? — Fractale, Mar 06 '19 at 09:51
have a look at https://books.google.ch/books?id=3haKUwK2U5oC&pg=PA133&lpg=PA133&dq=padding+matrix+performance&source=bl&ots=r_4YyTd24D&sig=ACfU3U1pDHXhSHchSZFeDZVO2d2GdHUAVQ&hl=de&sa=X&ved=2ahUKEwjbv8jUnu3gAhXlzaYKHWejBMIQ6AEwCnoECAIQAQ — Walter Kuhn, Mar 06 '19 at 09:55
@Fractale arrays with elements smaller than one word (8Byte or 4 Byte) on your system will never be aligned. But for a stream of data your compiler will only generate word wise loads. On a AMD64/x86-64 the access to the lower half of the registers is byte wise. For the upper half it is 4byte wise. You should test if that really matters before trying to optimize something. — user6556709, Mar 06 '19 at 10:03

score 2 · Answer 1 · answered Mar 04 '22 at 01:06

Array padding

The array padding technique consists of increasing the size of the array dimensions in order to reduce conflict misses when accessing a cache memory. This type of miss can occur when the number of accessed elements mapping to the same set is greater than the degree of associativity of the cache. Padding changes the data layout and can be applied (1) between variables (Inter-Variable Padding) or (2) to a variable (Intra-Variable Padding):

1. Inter-Variable Padding

float x[LEN], padding[P], y[LEN];

float redsum() {
    float s = 0;

    for (int i = 0; i < LEN; i++)
        s = s + x[i] + y[i];

    return s;
}

If we have a direct mapped cache and the elements x[i] and y[i] are mapped into the same set, accesses to x will evict a block from y and vice versa, resulting in a high miss rate and low performance.

2. Intra-Variable Padding

float x[LEN][LEN+PAD], y[LEN][LEN];

void symmetrize() {
    for (int i = 0; i < LEN; i++) {
        for (int j = 0; j < LEN; j++)
            y[i][j] = 0.5 *(x[i][j] + x[j][i]);
    }
}

In this case, if the elements of a column are mapped into a small number of sets, their sequence of accesses may lead to conflict misses, so that the spatial locality would not be exploited.

For example, suppose that during the first iteration of the outer loop, the block containing x[0][0] x[0][1] ... x[0][15] is evicted to store the block containing the element x[k][0]. Then, at the start of the second iteration, the reference to x[0][1] would cause a cache miss.

This technical document analyses the performance of the Fast Fourier Transform (FFT) as a function of the size of the matrix used in the calculations:

https://www.intel.com/content/www/us/en/developer/articles/technical/fft-length-and-layout-advisor.html

References

Gabriel Rivera and Chau-Wen Tseng. Data transformations for eliminating conflict misses. PLDI 1998. DOI: https://doi.org/10.1145/277650.277661

Changwan Hong et al. Effective padding of multidimensional arrays to avoid cache conflict misses. PLDI 2016. DOI: https://doi.org/10.1145/2908080.2908123

score 1 · Accepted Answer · answered Mar 06 '19 at 10:53

1

I don't think it would matter in a simple loop. Have a look at this answer: Does alignment really matter for performance in C++11?

The most interesting bit for you from that answer is probably that you could arrange your classes so that members used together are in one cache line and those used by different threads are not.

answered Mar 06 '19 at 10:53

Pan P

84
8

I think you find what I was looking for. Why you want different threads to access different cache lines? – Fractale Mar 07 '19 at 02:32
1

When you have two member variables that are close to each other they might fit in one cache line. And then when you have two threads, each trying to access one of those variables, they will not be able to do it at the same time. This is called "false sharing" - in theory those variables should be accessible simultaneously by different threads, but because they get loaded into one cache line, then threads have to wait for each other to be done with that memory. – Pan P Mar 07 '19 at 11:11

Why can adding padding make your loop faster?

2 Answers2

Array padding

1. Inter-Variable Padding

2. Intra-Variable Padding

References