How to optimize for dual, quad and higher multi-processors?

Question

Folks, I've been programming high speed software over 20 years and know virtually every trick in the book from micro-bench making cooperative, profiling, user-mode multitasking, tail recursion, you name it for very high performance stuff on Linux, Windows, and more.

The problem is that I find myself befuddled by what happens when multiple threads of CPU intensive work are exposed to a multi-core processors.

The results from performance in micro benchmarks of various ways of sharing date between threads (on different cores) don't seem to follow logic.

It's clear that there is some "hidden interaction" between the cores which isn't obvious from my own programming code. I hear of L1 cache and other issues but those are opaque to me.

Question is: Where can I learn this stuff ? I am looking for an in depth book on how multi-core processors work, how to program to capitalize on their memory caches or other hardware architecture instead of being punished by them.

Any advice or great websites or books? After much Googling, I'm coming up empty.

Sincerely, Wayne

I think a lot depends on what you are trying to accomplish. Server side v.s. desktop, graphics v.s. text processing v.s. simulating nuclear explosions; high throughput for a few dozen tasks v.s. low latency for 10,000 tasks. Is there any particular area that you're interested in? — BillRobertson42, Dec 26 '11 at 05:00
Okay. Good question. We're building, essentially, a CEP (Complex Event Processing) system which must process millions of small bits of data per second in soft-realtime. In theory, the performance on a single core shows that by multiplying X 4 for a quad core or X 8 the necessary performance is achievable. The data needs to flow from difference sources to combine and then multiplie differently to clients. There is CPU processing or I/O needed at each step so we built cooperative user-mode multi-tasking with lock free using Interlocked. But there's inexplicable (as yet) performance issues. — Wayne, Dec 26 '11 at 11:17
Another thing you might look into: http://www.1024cores.net/home/parallel-computing/taxi-paths/fighting-the-memory-bandwidth-problem — gjvdkamp, Dec 29 '11 at 12:04

score 4 · Accepted Answer · answered Dec 26 '11 at 04:48

4

This book taught me a lot about these sorts of issues about why raw CPU power is not necessary the only thing to pay attention to. I used it in grad school years ago, but I think all of the principles still apply:

http://www.amazon.com/Computer-Architecture-Quantitative-Approach-4th/dp/0123704901

And essentially a major issue in multi-process configurations is synchronizing the access to the main memory, if you don't do this right it can be a real bottleneck in the performance. It's pretty complex with the caches that have to be kept in sync.

answered Dec 26 '11 at 04:48

Francis Upton IV

19,322
3
53
57

1

Yes, this is the book he needs. See esp. chapter 4. – ThomasMcLeod Dec 27 '11 at 01:52
Yeah this one seems to cover it all, removed my post. – gjvdkamp Dec 29 '11 at 11:25
Showing my age, my copy of this is actually a pre-release version of the book that I got when I was in Hennessey's architecture class at Stanford. He was a very good teacher. No surprise he went on to be the president of the university. – Francis Upton IV Dec 29 '11 at 16:24
Francis, well your comments about the book seem to touch on the most critical issues. I will check out the book. If it is all you folks say that it is, then I'll select you as the right answer, Francis. – Wayne Jan 01 '12 at 12:28

score 4 · Answer 2 · edited Jun 20 '20 at 09:12

my own question, with answer, on stackoverflow's sister site: https://softwareengineering.stackexchange.com/questions/126986/where-can-i-find-an-overview-of-known-multithreading-design-patterns/126993#126993

I will copy the answer to avoid the need for click-through:

Quote Boris:

Parallel Programming with Microsoft .NET: Design Patterns for Decomposition and Coordination on Multicore Architectures https://rads.stackoverflow.com/amzn/click/0735651590

This is a book, I recommend wholeheartedly.

It is:

New - published last year. Means you are not reading somewhat outdated practices.

Short - about 200+ pages, dense with information. These days there is too much to read and too little time to read 1000+ pages books.

Easy to read - not only it is very well written but it introduces hard to grasps concepts in really simple to read way.

Intended to teach - each chapter gives exercises to do. I know it is always beneficial to do these, but rarely do. This book gives very compelling and interesting tasks. Surprisingly I did most of them and enjoyed doing them.

additionally, if you wish to learn more of the low-level details, this is the best resource i have found: "The Art of Multiprocessor Programming" It's written using java as their code samples, which plays nicely with my C# background.

PS: I have about 5 years "hard core" parallel programming experience, (abet using C#) so hope you can trust me when I say that "The Art of Multiprocessor Programming" rocks

score 2 · Answer 3 · edited May 23 '17 at 10:30

2

My answer on "Are you concerned about multicores"
Herb Sutter's articles
Video Series on Parallel Programming

edited May 23 '17 at 10:30

Community

1
1

answered Dec 26 '11 at 04:54

amit kumar

20,438
23
90
126

score 2 · Answer 4 · answered Dec 29 '11 at 11:40

2

One specific cause of unexpected poor results in parallelized code is false sharing, you won't see that coming if you dont know what's going on down there (I didn't). Here a two articles that dicuss the cause and remedy for .Net:

http://msdn.microsoft.com/en-us/magazine/cc872851.aspx

http://www.codeproject.com/KB/threads/FalseSharing.aspx

Rgds GJ

answered Dec 29 '11 at 11:40

gjvdkamp

9,929
3
38
46

It claims to solve the problem, but does so only when all data is contained in a single object (in this case, an array of value types). .NET gives you no control whatsoever over storage of distinct objects, you need to use native code to avoid false sharing in complex real-world scenarios. – Ben Voigt Dec 31 '11 at 01:59
You're right about having little control about where the object resides, but you can guestimate when data is close together in memory (same object or in an array). Then you can control how you partition the work among threads/ tasks, so they don't operate on neighboring data. That should allow you to minimize the issue. – gjvdkamp Dec 31 '11 at 09:19

score 1 · Answer 5 · answered Feb 13 '12 at 17:23

There are different aspects to multi-threading requiring different approaches.

On a webserver, for example, the use of thread-pools is widely used since it supposedly is "good for" performance. Such pools may contain hundreds of threads waiting to be put to work. Using that many threads will cause the scheduler to work overtime which is detrimental to performance but can't be avoided on Linux systems. For Windows the method of choice is the IOCP mechanism which recommends a number of threads not greater than the number of cores installed. It causes an application to become (I/O completion) event driven which means that no cycles are wasted on polling. The few threads involved reduce scheduler work to a minimum.

If the object is to implement a functionality that is scaleable (more cores <=> higher performance) then the main issue will be memory bus saturation. The saturation will occur due to code fetching, data reading and data writing. An incorrectly implemented code will run slower with two threads than with one. The only way around this is to reduce the memory bus work by actively:

tailoring the code to a minimal memory footprint (= fits in the code cache) and which doesn't call other functions or jump all over the place.
tailoring memory reads and writes to a minimum size.
informing the prefetch mechanism of coming RAM reads.
tailoring the work such that the ratio of work performed inside the core's own caches (L1 & L2) is as great as possible in relation to the work outside them (L3 & RAM).

To put this in another way: fit the applicable code and data chunks into as few cache lines (@ 64 bytes each) as possible because ultimately this is what will decide the scaleability. If the cache/memory system is capable of x cache line operations every second your code will run faster if its requirements are five cache lines per unit of work (=> x/5) rather than eleven (x/11) or fifty-two (x/52).

Achieving this is not trivial since it requires a more or less unique solution every time. Some compilers do a good job of instruction ordering to take advantage of the host processor's pipelining. This does not necessarily mean that it will be a good ordering for multiple cores.

An efficient implementation of a scaleable code will not necessarily be a pretty one. Recommended coding techniques and styles may, in the end, hinder the code's execution.

My advice is to test how this works by writing a simple multi-threaded application in a low-level language (such as C) that can be adjusted to run in single or multi-threaded mode and then profiling the code for the different modes. You will need to analyze the code at the instruction level. Then you experiment using different (C) code constructs, data organization, etc. You may have to think outside the box and rethink the algorithm to make it more cache-friendly.

The first time will require lots of work. You will not learn what will work for all multi-threaded solutions but you will perhaps get an inkling of what not to do and what indications to look for when analyzing profiled code.

Wayne · Answer 6 · 2012-01-01T12:25:11.307

0

I found this link that specifically explains the issues of multicore cache handling on CPUs that was affecting my multithreaded program.

http://www.multicoreinfo.com/research/intel/mem-issues.pdf

The site multicoreinfo.com in general has lots of good information and references about multicore programming.

edited Jan 01 '12 at 12:25

answered Dec 31 '11 at 01:39

Wayne

2,959
3
30
48

1

Wow an article about false sharing, but then for c++. How clever of you to figure that out all by yourself! – gjvdkamp Dec 31 '11 at 11:16

How to optimize for dual, quad and higher multi-processors?

6 Answers6