0

I try to traverse a block of data, perform a scale operation on each value, and obtain performance info t1, such as A.cpp in the code example.

In addition, before I perform the same operation, I allocate a 50M heap memory, then simply traverse it (In order to cover the size of L1 and L2, to achieve the purpose of clearing L1 and L2), and obtain performance info t2, such as B.cpp in the code example.

Finally, I compared the two performance info and found that t2 is significantly less than t1, that is, the performance of B.cpp is better.(The same happens on both x86 and arm)

// A.cpp
auto a = (float*)new float[8 * 1 * 8192];
auto b = (float*)new float[8 * 1 * 8192];
auto c = (float*)new float[8 * 1];
auto d = (float*)new float[8 * 1];
void ScaleAndAddBias(float * dstZ, const float* srcZ, const float* biasZ, const float* alphaZ, size_t planeNumber) {
        for (int p = 0; p < planeNumber; ++p) {
            float* dstX       = dstZ + 8 * p;
            const float* srcX = srcZ + 8 * p;
            for (int i = 0; i < 8; i++) {
                dstX[i] = srcX[i] * alphaZ[i] + biasZ[i];
            }
        }
}
auto begin = time();
ScaleAndAddBias(a, b, c, d, 8192);
auto end = time();
auto t1 = end - begin;
// B.cpp
const size_t bigger_than_cachesize = 50 * 1024 * 1024;
long *p = new long[bigger_than_cachesize];
for (int j = 0; j < bigger_than_cachesize; j++) {
    p[j] += 1;
}
auto a = (float*)new float[8 * 1 * 8192];
auto b = (float*)new float[8 * 1 * 8192];
auto c = (float*)new float[8 * 1];
auto d = (float*)new float[8 * 1];
void ScaleAndAddBias(float * dstZ, const float* srcZ, const float* biasZ, const float* alphaZ, size_t planeNumber) {
        for (int p = 0; p < planeNumber; ++p) {
            float* dstX       = dstZ + 8 * p;
            const float* srcX = srcZ + 8 * p;
            for (int i = 0; i < 8; i++) {
                dstX[i] = srcX[i] * alphaZ[i] + biasZ[i];
            }
        }
}
auto begin = time();
ScaleAndAddBias(a, b, c, d, 8192);
auto end = time();
auto t2 = end - begin;

question: As we all know, we can make use of spatial locality and time locality, but in B.cpp, the memory traversed in advance is not the same segment of memory, and the two segments of memory accessed are not contiguous. Why does it improve the access performance of this segment of memory?

Chris
  • 26,361
  • 5
  • 21
  • 42
  • 1
    Have you enabled compiler optimisations? What are the timing results you see? It's likely that the second test simply puts the cpu into high power mode ready to execute the timed portion whereas the first the cpu "warming up" is included in the timing. What cpu are you using? – Alan Birtles Oct 14 '22 at 06:50
  • thanks for your response, 1. i compile in release mode, 2. the timing result,only contains the scale process, the performance gap(between t1 and t2) is especially noticeable in arm(cortex-a55), 3. "puts the cpu in high power mode" might be a good explanation, but i run A.cpp 100 times, in B.cpp, run warm up once, run scale 100 times, get the same result. – zhiyujiang Oct 14 '22 at 07:07
  • What surprised you about this result? You accesses can be easily predicted so, besides the first few, they almost always hit the cache. In B you also prime the pages so all the work of actually allocating them (see overcommitment) and translating them is not measured. – Margaret Bloom Oct 14 '22 at 08:22
  • Probably various warmup effects, including fault-around in the pagefault handler to wire up nearby pages, including some or all of the ones touched in the timed region. So maybe there are fewer page faults in the timed region. And likely page walks are cheaper as some higher levels of the page tables get into caches inside the page walkers. Also, if you're lucky, next-page prefetch might even prefetch the TLB. Also CPU frequency warm-up effects might be a factor. [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) – Peter Cordes Oct 17 '22 at 04:07
  • Hard to guess more without details on OS and CPU, and tuning settings like transparent hugepages. – Peter Cordes Oct 17 '22 at 04:08
  • @Peter Cordes,The phenomenon occurs frequently. This phenomenon originally occurred when I used mnn for model inference. Every time before inference, pre-traversal will make the inference performance better.I abstracted the above codes, hope to Reproduce the problem. In mnn's reasoning, memory relationships are not as simple as here, more complex,and don't seem to appear randomly, the performance increases is obvious on arm.I'm curious that pre-traversal of unrelated memory improves performance, I will refer to the idea you put forward, is there any possible reason to explain this phenomenon? – zhiyujiang Oct 17 '22 at 09:05

0 Answers0