3

I'm implementing am algorithm on C/C++ to process some vectors and I thought it could be a good idea to make it parallel since I'm working with a multicore CPU. I have some experience with GPGPU and there bad memory access can ruin the entire performance, do I need to consider any special access layout between the cores on the CPU also?

Thanks

John Vint
  • 39,695
  • 7
  • 78
  • 108
Caian
  • 440
  • 4
  • 15
  • 2
    Short answer: Yes. Even without multi-threading, CPUs can have some very nasty performance gotchas. Examples: [this](http://stackoverflow.com/q/8547778/922184) and [this](http://stackoverflow.com/q/7905760/922184). Multi-threading only makes things more complicated. – Mysticial Feb 05 '12 at 00:11
  • If you didn't use GPGPUs for arithmetically intense computations (which is what they're designed for), then you may have had problems with memory access limits. However, with respect to multi-threading, you will have far more concern about context switching and synchronization (which are generally the biggest issues with concurrent programming), than you would with memory access latency. – Kiril Feb 05 '12 at 00:14
  • @Lirik: it all depends on the size of the work involved. If the vectors are short and there's little work context-switching and synchronization could be an issue. If the vectors are long and the work substantial memory organization, cache manipulation and code tweaking are EVERYTHING. – Olof Forshell Feb 06 '12 at 13:11

2 Answers2

5

There are a number of memory-related problems you can run into with a multiprocessor setup, and some of them can slow an application to a crawl.

You need to be roughly aware of the cache line size on your box and attempt 2 things:

  1. Limit the number of data cache lines (particularly cache lines you write to) accessed in close time sequence by a single thread. Ie, avoid "dirtying" more cache lines than you must.
  2. Avoid like the plague having two separate threads "simultaneously" access the same data cache line, with either one writing.

(The above two rules also apply to data pages, if you're dealing with large data structures that must be paged.)

Where possible, set up separate working data structures (especially heap) for each thread, rather than sharing the data. Especially beware of having a common counter that all threads update, and (obviously) avoid locks and semaphores except at critical junctures where you absolutely need to synchronize threads.

Hot Licks
  • 47,103
  • 17
  • 93
  • 151
  • @Hot_Licks, minor tweak: E.g. avoid "touching" (reading) or "dirtying" (writing") more cache lines than you must. – Krazy Glew Apr 28 '12 at 14:28
  • @KrazyGlew - "Limit the number of data cache lines (particularly cache lines you write to) accessed in close time sequence by a single thread." – Hot Licks Apr 30 '12 at 01:37
1

@Hot_Licks: Actually, if the threads are two hyperthreads running on the same core, then there is no problem with having different threads access them, in either read or write. Clean lines are shared cost-free between hardware threads on the same Intel CPU. Even dirty lines are shared very cheaply - although you can get MOnukes if one guy is reading the data at the same time the other is writing. (Oddly enough, no penalty if two such hardware/hyperthreads are writing at the same time.)

With AMD's only "threaded" CPU, Bulldozer, I think that write sharing is even less costly.

But this applies only to hardware threads, e.g. Intel hyperthreads or logical processors, running on the same physical processors. If they are running on different physical processors, no win. Since most software threading packages migrate threads arbitrarily, your rule is not so bad.

Nevertheless, you still want to minimize (a) lines accessed by a single thread, and (b) the total of lines accessed by multiple threads, even if not shared by other threads. Since caches - MLC, LLC - are a limited resource. But you are right - once you are missing the cache...

Krazy Glew
  • 7,210
  • 2
  • 49
  • 62