14

I just know basic ideas on aligned memory allocation. But I didn't cared much about align issue because I am not an assembly programmer, also didn't have experience with MMX/SIMD. And I think this is the one of the the premature optimizations.

These days people saying more and more about cache hit, cache coherent, optimization for size, etc. Some source code even allocate memory explicitly aligned on CPU cache lines.

Frankly, I don't know how much is the cache line size of my i7 CPU. I know there will be no harm with large size align. But will it really pay off, without SIMD ?

Let's say there 100000 items of 100 bytes data in a program. And access to these data is the most intensive work of the program.

If we change the data structure and make all the 100 bytes size data aligned by 16 byte, is it possible to gain noticeable performance gain ? 10%? 5%?

9dan
  • 4,222
  • 2
  • 29
  • 44
  • 2
    Speaking of premature optimization, did you know that good algorithms can often give hundreds or thousands of percent speed increase for larger data sets (and even more for even larger one)? ;) Details like how well the program plays with the cache are on the list for high-performance computing, but for most applications out there, it will never matter. –  Jan 05 '11 at 15:05
  • 1
    I believe 64bytes is a common cache-line size, not 16bytes. – edA-qa mort-ora-y Jan 05 '11 at 18:15
  • I got a tenfold improvement out of an algorithm once by cache-aligning and prefetching its memory accesses. – Crashworks Mar 22 '12 at 21:01

4 Answers4

8

This is one of my favorite recent blogs about cache effects. http://igoro.com/archive/gallery-of-processor-cache-effects/

jcopenha
  • 3,935
  • 1
  • 17
  • 15
  • Best explanation on the processor cache I ever read. It is really dramatic effects of the cache. I'm scared if I was too naive.. – 9dan Jan 05 '11 at 14:53
5

Cache optimization pay even for monothread application. But cache optimization isn't necessarily aligning data at the start of the cache as there are several factors to take into considerations. So the way to go is:

  • do you meet your performance requirement? If yes, why spending time to optimize. Optimizing for the sake of optimizing pay rarely.

  • measure where your bottleneck is. If you suspect cache problems, use a tool which reports cache miss and so get an idea of how much you could win.

At the higest level, the goal of cache optimization is to fill up your cache with interesting data while keeping non interesting data out of it. If you are doing multithread programming, preventing interference between thread is also important. Then you have also to prevent some things which are specific to some cache implementation, such as resonance effects which sometimes reduce the effectice cache size for non fully associative cache.

AProgrammer
  • 51,233
  • 8
  • 91
  • 143
  • So apparently for read, probably for write, memory allocation aligned on cache line is not a important issue, isn't it? – 9dan Jan 05 '11 at 16:32
  • 1
    If your data is read only, what is important is that data accessed together stays in the cache as far as possible. The line size of i7 is 64 bytes (see http://www.agner.org/optimize/microarchitecture.pdf) so one of your data correctly aligned will span 2 cache lines, while if it isn't it will take 3 cache lines. So it could help (did I wrote that measuring was the way to go when you want to optimize?) – AProgrammer Jan 05 '11 at 16:47
4

It depends on your system. Try it, run some benchmarks, and find out.

OrangeDog
  • 36,653
  • 12
  • 122
  • 207
  • 1
    Then, it is the premature optimization really. And how one can do it without reliable CPU detection function and list of CPU cache information? Hmm.. Am I worrying too much for nothing? – 9dan Jan 05 '11 at 14:29
  • 1
    All optimisation is premature until you have actually tested what is being slow. – OrangeDog Jan 05 '11 at 14:32
  • @9dan - You don't need those things, just a clock. – OrangeDog Jan 05 '11 at 14:33
  • @OrangeDog I mean, because benchmark result will be varying by CPU, I can't apply the cache aware optimization without CPU detection function. – 9dan Jan 05 '11 at 15:01
  • @9dan - That's what makefiles and #ifdef are for. However, these sort of optimisations are unlikely to significantly reduce performance: they'll either improve it or there'll be little effect. – OrangeDog Jan 05 '11 at 15:14
  • 1
    @9dan worrying about false sharing ahead of time is not premature if you need multiple cores to handle processing. If you have unintended sharing your performance will be worse than just using a single processor. So far any system where multiple cores have to work on the same data I would say proper data design is not an optimization, but a requirement. – edA-qa mort-ora-y Jan 05 '11 at 18:20
4

Most of the discussions on cache line alignment deal with high-performance computing working with many threads, and keeping scalability as close to linear as possible. In those discussions the reason for cache line alignment is to prevent a write to one data variable invalidating the cache line that also contains another variable used by a different thread.

So, unless you are trying to write code that will scale to a very high number of processor cores, cache line alignment probably won't matter much to you. but again, test it and see.

diverscuba23
  • 2,165
  • 18
  • 32
  • I think all answers gave its own lessons but this answer may be the rule of thumb (but minimum vote), so I accepted this. Thank you. – 9dan Jan 05 '11 at 16:06
  • 2
    Two threads are enough for false sharing showing its hugly head. (But with data size of 100 bytes, I doubt that false sharing is a problem for them OP). – AProgrammer Jan 05 '11 at 16:09
  • @Aprogrammer good point about write. I've been only concerned about read. – 9dan Jan 05 '11 at 16:30
  • Yes, false sharing can destroy a program. Alignment can also used to optimize sharing. Those variables that will always be dirty and needed by many threads can all be packed into a single cache-line. This means only one cache-line needs to be updated. – edA-qa mort-ora-y Jan 05 '11 at 18:17
  • 1
    I disagree, even for single-threaded application with lots of memory access alignment could make a significant difference in performance. – Ruud Apr 02 '12 at 18:57