7

I am a bit confused about what memory looks like in a dual CPU machine from the perspective of a C/C++ program running on Linux.

Case 1 (understood)

With one quad-core HT CPU, 32GB RAM, I can, in theory, write a single process application, using up to 8 threads and up to 32GB RAM without going into swap or overloading the threading facilities - I am ignore the OS and other processes here for simplicity.

Case 2 (confusion)

What happens with a dual quad-core HT CPU with 64GB RAM set up?

Development-wise, do you need to write an application to run as two processes (8 threads, 32GB each) that communicate or can you write it as one process (16 threads, 64GB full memory)?

If the answer is the former, what are some efficient modern strategies to utilize the entire hardware? shm? IPC? Also, how do you direct Linux to use a different CPU for each process?

kfmfe04
  • 14,936
  • 14
  • 74
  • 140
  • 2
    There's a bunch of ways to do this. One way is to use two processors and pin them. Then use IPC. The other way is to just treat it as a single shared memory machine and either eat the NUMA overhead, or play tricks with memory allocation and affinity to squeeze out every last drop. – Mysticial Mar 22 '13 at 04:53
  • 1
    From an application point of view, a dual quad-core machine is a octo-core (ie 16 processor threads) machine. However, the timings and delays might be different. – Basile Starynkevitch Mar 22 '13 at 06:12

2 Answers2

7

From the application's viewpoint, the number of physical CPUs (dies) doesn't matter. Only the number of virtual processors. These include all cores on all processors, and double, if hyperthreading is enabled on a core. Threads are scheduled on them in the same way. It doesn't matter if the cores are all on one die or spread across multiple dies.

In general, the best way to handle these things is to not. Don't worry about what's running on which core. Just spawn an appropriate number of threads for your application, (up to a theoretical maximum equal to the total number of cores in the system), and let the OS deal with the scheduling.

The memory is shared amongst all cores in the system, of course. But again, it's up the OS to handle allocation of physical memory. Very few applications really need to worry about how much memory they use, and divvying up that memory between threads. Let the OS handle that.

Jonathon Reinhart
  • 132,704
  • 33
  • 254
  • 328
  • ok - that's the threading part - what about the memory? I can access the full 64GB in one process (16 threads)? Both CPUs can access the full RAM? – kfmfe04 Mar 22 '13 at 04:56
  • 3
    @kfmfe04 Yes, both CPUs can access the full RAM. The performance won't be even though. (NUMA) But they can access each other's memory. – Mysticial Mar 22 '13 at 04:57
  • @Mysticial - interesting: then I can just do the simple model of one process using 64GB (16 threads) then... ...I didn't know this: thought with dual CPU, each CPU could only access its separate bank of memory. – kfmfe04 Mar 22 '13 at 04:59
  • @kfmfe04 You are concerning yourself with details that, as an application programmer, you should very rarely ever have to worry about. Those are all system-level details. – Jonathon Reinhart Mar 22 '13 at 04:59
  • @JonathonReinhart in general, I agree with you - but I am working on a very specific (scientific simulation) application that is very memory/hardware dependent, so I need to think about these things. Otherwise, I would be using Java instead of C/C++... – kfmfe04 Mar 22 '13 at 05:01
  • @kfmfe04 Yeah. They can access each other's memory. But they can access their own memory faster. For dual socket systems, the latency difference between local memory and other CPU's memory is about 30% - 50%. In the large (quad-socket) machines, that difference increases drastically. So at *some* point it becomes so bad, that the best way is to switch to MPI-based programing methods. – Mysticial Mar 22 '13 at 05:01
  • @Mysticial Do modern OSs take NUMA into account when scheduling processes on CPUs? In other words, does the OS know that most of the physical pages for process 100 live in one half of memory, so it should schedule that process (thread) on a core that is "connected" to that memory? I've never really concerned myself with the implications of NUMA. – Jonathon Reinhart Mar 22 '13 at 05:01
  • @Mysticial wow - that's really interesting! I'd be interested in a detailed answer from that POV - I suspected you might have some experience in that area from your PI calcs... – kfmfe04 Mar 22 '13 at 05:02
  • @kfmfe04: No. With dual CPU both CPUs can access all memory. Actually, that quad core CPU you're talking about is a four CPU machine. Not a single CPU machine. A "core" is a CPU. – slebetman Mar 22 '13 at 05:02
  • @JonathonReinhart They do for memory allocation to some extent. But not necessarily for thread affinities. Typically the worst case happens and if you don't intervene (and pin your threads and your memory allocations), you will find that your threads and your memory are all in the wrong places. For a dual-socket machine with low inter-socket latencies, it's not noticable. But for bigger machines (> 4 sockets) it's really bad. – Mysticial Mar 22 '13 at 05:04
  • 3
    Btw, the question that brought me to Stackoverflow was a [question on dual-socket NUMA](http://stackoverflow.com/questions/7259363/measuring-numa-non-uniform-memory-access-no-observable-asymmetry-why). :) – Mysticial Mar 22 '13 at 05:05
  • @Mysticial I was aware of processor affinity, but how does one determine (as a usermode process) which range of physical memory their heap, for example, will be allocated in? I mean `brk` is `brk`, right? – Jonathon Reinhart Mar 22 '13 at 05:06
  • @JonathonReinhart The numactl API in Linux let's you control it to some extent. You can either "suggest" an affinity in which it will *try* to allocate it on a specific node. Or you can "force" it onto a node in which it still fail if it can't. I'm not sure if there's a way to determine where a specific address is though. – Mysticial Mar 22 '13 at 05:08
  • @Mysticial Interesting stuff. This is getting a bit off-topic though, so I'll let it go and (following my own advice) do my own research :-P Thanks for opening my eyes to this, however! – Jonathon Reinhart Mar 22 '13 at 05:10
  • @JonathonReinhart Yeah, it's a big can of worms. For all practical purposes, a dual-socket machine is a normal machine. 4 and up and you need to really manage the NUMA. The APIs are generally crap and poorly documented. Most programs that run on these large machines are MPI based. – Mysticial Mar 22 '13 at 05:12
0

The memory model has ** nothing ** to do with number of cores per se, rather it has to do with the architecture employed on multi core computers. Most mainstream computers use symmetric multi processing model, wherein a single OS is controlling all the CPUs, and programs running on those CPUs have access to all the available memory. Each CPU does has private memory (cache), but the RAM is all shared. So if you have 64 bit machine it makes zilch difference whether you write 1 process, or two processes AS FAR AS memory usage implications are concerned. Programming wise you would be better to use a single process.

As other pointed out, you do need to worry about thread affinities and such, but that has more to do with efficient use of CPU resources, and little to do with RAM usage. There would be some implications of cache usage though.

Contrast with other memory model computers, like NUMA (Non-Uniform Memory Access), where each CPU has its own block of memory, and communicating across CPUs then requires some arbiter in between. On these computers you WOULD NEED to worry about where to place your threads, memory wise.

Amit
  • 1,836
  • 15
  • 24