3

I want to implement parallel processing in yasm program using POSIX thread (or simply pthread) library.

Code

Here is the most important part of my program.

section .data
pThreadID1      dq      0
pThreadID2      dq      0
MAX:            dq      100000000
value:          dd      0

section .bss
extern          pthread_create
extern          pthread_join

section .text

global main
main:
    ;create the first thread with id = pThreadID1
    mov     rdi, pThreadID1 
    mov     rsi, NULL
    mov     rdx, thread1
    mov     rcx, NULL
    call    pthread_create
    ;join the 1st thread
    mov     rdi, qword [pThreadID1]
    mov     rsi, NULL
    call    pthread_join 

    ;create the second thread with id = pThreadID2
    mov     rdi, pThreadID2 
    mov     rsi, NULL
    mov     rdx, thread2
    mov     rcx, NULL
    call    pthread_create
    ;join the 2nd thread
    mov     rdi, qword [pThreadID2]
    mov     rsi, NULL
    call    pthread_join
    ;print value block

where thread1 contains loop in which value is being incremented by one MAX/2 times:

global thread1
thread1:
    mov     rcx, qword [MAX]
    shr     rcx, 1

    thread1.loop:
        mov     eax, dword [value]
        inc     eax
        mov     dword [value], eax
        loop    thread1.loop

    ret

and thread2 is similar. NOTE: thread1 and thread2 share the variable value.

Result

I assemble and compile the above program as follows:

yasm -g dwarf2 -f elf64 Parallel.asm -l Parallel.lst
gcc -g Parallel.o -lpthread -o Parallel

Then I use time command in order to know the elapsed execution time:

time ./Parallel

And I get

value: +100000000

real    0m0.482s
user    0m0.472s
sys 0m0.000s

Problem

OK. In the program above I create one thread wait until it has been finished and only then create the second one. Not the best "threading", isn't it? So I change the order in the program as follows:

;create thread1
;create thread2
;join   thread1
;join   thread2

I expect that elapsed time will be less in this case but I get

value: +48634696

real    0m2.403s
user    0m4.772s
sys 0m0.000s

I understand why value is not equal to MAX but what I do not understand is why in this case the elapsed time is significantly more? Am I missing something?

EDIT

I decided to exclude overlap between thread1 and thread2 by using different variables for each of them and then just add the results. In this case "parallel" order gives less elapsed time (compared to the previous result) but, anyway, greater than "series" order.

Code

Only changes shown

Data

Now there are two variables --- one for each of threads.

section .data
value1:         dd      0
value2:         dd      0

Threads

Each thread is responsible for incrementing of its own value.

global thread1
thread1:
    mov     rcx, qword [MAX]
    shr     rcx, 1

    thread1.loop:
        mov     eax, dword [value1]
        inc     eax
        mov     dword [value1], eax
        loop    thread1.loop

    ret

thread2 is similar (replace 1 by 2).

Getting final result

Assuming that comments represent corresponding blocks of code from the Code section in the beginning of the question the program is the following.

Parallel order
;create thread1
;create thread1

;join thread1
;join thread2

mov     eax, dword [value]
add     eax, dword [value1]
add     eax, dword [value2]
mov     dword [value], eax
Result
value: +100000000

Performance counter stats for './Parallel':

   3078.140527      cpu-clock (msec)                                            

   1.586070821 seconds time elapsed
Series order
;create thread1
;join thread1

;create thread2
;join thread2

mov     eax, dword [value]
add     eax, dword [value1]
add     eax, dword [value2]
mov     dword [value], eax
Result
value: +100000000

Performance counter stats for './Parallel':

    508.757321      cpu-clock (msec)                                            

   0.509709406 seconds time elapsed

UPDATE

I've drawn a simple graph that reflects time dependencies against MAX value for 4 different modes. enter image description here

LRDPRDX
  • 631
  • 1
  • 11
  • 23
  • The performance is basically the same as if you were incrementing two non-overlapping counters in the same cache line from two threads. That situation is called *false sharing*. Look at `ocperf.py stat -e machine_clears.memory_ordering ./Parallel` (https://github.com/andikleen/pmu-tools), and see also https://stackoverflow.com/questions/45602699/what-are-the-latency-and-throughput-costs-of-producer-consumer-sharing-of-a-memo. – Peter Cordes Mar 25 '18 at 11:49
  • See also https://stackoverflow.com/questions/39393850/can-num-be-atomic-for-int-num for how CPUs maintain cache coherency with multiple cores trying to load / store to the same thread: MESI RFO requests to get exclusive access. (Upvoted for actually doing an experiment yourself, and getting correct experimental data, though. You don't show the actual loop that does the incrementing, but for this it doesn't really matter how you do it, separate load / inc / store or a memory-destination add or inc are equivalent for this. :) – Peter Cordes Mar 25 '18 at 11:50
  • @PeterCordes, I have edited the post. I have add code for the loop and some update. – LRDPRDX Mar 25 '18 at 14:26
  • Why do `thread1` and `thread2` have different code at all? You could just run the same function in both threads (pass a pointer arg). And BTW, [avoid the slow `loop` instruction](https://stackoverflow.com/questions/35742570/why-is-the-loop-instruction-slow-couldnt-intel-have-implemented-it-efficiently). Although even without memory-ordering machine nukes, the store/reload dep chain is as much of a bottleneck as 5-cycle per iteration `loop` on Skylake. Also, use `default rel` for RIP-relative addressing for addressing modes like `[value]`. (Again won't make a diff, but better in general.) – Peter Cordes Mar 25 '18 at 14:53
  • Is your overlap / non-overlap timing difference repeatable? Or was it maybe just chance, like ending up on the same vs. different physical cores for some of the time, if you have a CPU with hyperthreading? And BTW, compare putting your two counters at least 64 bytes apart, so they're in separate cache lines. And definitely use `ocperf.py` to look at performance counters for this; memory ordering machine clears should fully explain all the performance effects you're seeing. – Peter Cordes Mar 25 '18 at 14:56
  • It would be helpful if you'd post the actual code to the interesting variants. You posted the serial variant, but don't show `thread2`, and you don't post the code for the parallel variant. You don't show the code for the part mentioned after the **Edit** either. Details are important here, so having the exact code would great (among other things, it would let me try to reproduce locally). – BeeOnRope Mar 27 '18 at 18:43
  • So that `add eax, dword [value1]` stuff in the "Getting Final Result" section is just in the parent thread? It's confusing because you're storing `eax` back into memory when you're done, instead of keeping in in a register where you need it. I thought this was a new loop body you were showing for a while. Have you tried multiple runs to see how reproducible the `1.59s` elapsed time is for false sharing vs. 2.4s for true sharing? I see that CPU time is nearly double that, so it wasn't that one thread finished early and the other thread then ran quickly. – Peter Cordes Mar 27 '18 at 22:33
  • @PeterCordes, Yes. the `add eax, dword [value1]` and others in the parent thread. I do not understand what do you mean by "keeping it [register?] where you need it". And what "new loop body you were showing for a while" are you talking about? Sorry for my misunderstanding. – LRDPRDX Mar 28 '18 at 06:25
  • I thought at first you were showing that code because it was a new loop body that you were testing, which did 3 loads + 1 stores instead of 1 load + 1 store. Adding up the results after both `pthread_join` complete isn't that interesting (so I wasn't expecting to see it), and it looked way overcomplicated for the way you're actually using it. First, because if you have two separate `value1` and `value2` static locations, why do you also need a `value` static location? `mov esi, [value1]` / `add esi, [value2]` / `lea rdi, [rel fmt]` / `call printf` is what I'd expect. (Or `mov edi, fmt`) – Peter Cordes Mar 28 '18 at 12:59
  • Thanks for adding the graph, that's a nice addition. Seems to confirm that being faster with false sharing than true sharing is a real thing, but I don't know why. `align 16` before `value1` would make *sure* they're both in the same cache line, and `align 64` after `value1` before `value2` would make sure they aren't (which should give you 2x the performance of Serial) – Peter Cordes Mar 28 '18 at 13:03

1 Answers1

2

The version with two threads running at once is slower because the two cores running your code will compete over the cache line that contains your counter value. The cache line will have to move back and forth between the two cores, which takes 10s of cycles each time, and only a few increments will occur before it moves back to the other core. Compare that to the single-threaded cases where the increment can occur once per ~5 cycles1, limited by store-forwarding latency and the slow loop instruction.

This is true both for the case where the threads increment a shared value, and for your other case where they are distinct. It applies even in the latter case because the values value1 and value2 are declared to occupy contiguous locations in memory and thus appear on the same cache line. Since coherency occurs at cache-line granularity this so-called "false sharing" effect is similar to true sharing (the first case).

You mentioned that although both true and false sharing cases were much slower than the single-threaded case, the true sharing case was still even slower than the false sharing case. I had thought, without testing, that these two cases were equivalent performance-wise, but that they aren't makes sense: while both suffer the cache-line thrashing described above, the true sharing case additionally perhaps suffers additional memory order clears - although the exact mechanism isn't clear.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • I suspect that the non-overlapping test was in response to my first comment, which suggested that performance would be identical to the more common false-sharing situation with two counters in the same cache line. So yes, the two counters probably were *intentionally* in the same line. It's weird that the performance wasn't identical, though. – Peter Cordes Mar 26 '18 at 22:20
  • @PeterCordes - well I kind of object to the idea that false sharing (same cache line, different memory location) is more common than _true sharing_ (same location) if that's what you mean. Certainly _false sharing_ is often mentioned, and has an single name that is easy to recognize so it gets a lot of "press". It is often specifically called out because it's a case that is easy to fix, since the sharing is more coincidental than fundamental. It is often a problem in well designed code that explicitly tries to reduce true sharing, where false sharing is a "gotcha" that depends on details... – BeeOnRope Mar 27 '18 at 18:48
  • ... - but true sharing is very common, under a variety of different names ("contention", "synchronization", "cross thread traffic", etc). It's fundamentally designed into any code that uses mutexes, shared counters, lock-free structures, etc. So now I feel like we have this confusing situation were people describe true sharing as "like false-sharing except they happen to be the exact same location" which is really backwards to me: I'd expect false sharing to be described in terms of true sharing costs, but the reverse seems to happen! – BeeOnRope Mar 27 '18 at 18:50
  • 1
    So as far as I understand the OPs scenario, the original code used a shared counter (that's why it doesn't sum to `100,000,000`) and the additional case added after your comment uses distinct locations but with an unknown distance between then. It's all speculation since code was provided for the interesting cases. I can't explain the performance difference either, but it's not _that_ big and it could entirely be some transient thing since this type of stuff often has high variance depending on external factors. – BeeOnRope Mar 27 '18 at 18:54
  • 1
    @BeeOnRope, Post edited. Please check it and say if something else should be added/changed in order to better presentation. P.S. Look at EDIT section for changes and beginning text for reference. – LRDPRDX Mar 27 '18 at 21:10
  • I meant with an access pattern like the OP is testing, with no synchronization and non-atomic RMW. It can actually happen with false sharing, but would be an obvious race bug with true sharing if two threads were really hammering away for long enough to measure. Thus you won't find it in a debugged program. I see what you mean about the confusion, though. – Peter Cordes Mar 27 '18 at 22:20
  • @peter - oh yeah right there is no sync at all so this makes it an unusual case of sharing (not always a bug though, certainly some interesting patterns are possible in x86 without any locked operations, but incrementing a counter from two threads in a loop is not one of them). Once you have contention I found that for some patterns lock doesn't matter at all. Maybe what's happening here is that you don't get memory ordering clears in the disjoint location case, ie snoops are fine-grained in their address. – BeeOnRope Mar 27 '18 at 23:17
  • 1
    Guys, I have drawn graph Time vs. MAX for all modes. Check it in the **UPDATE** section. – LRDPRDX Mar 28 '18 at 08:15
  • @LRDPRDX - interesting. I would conclude that false sharing and true sharing actually have somewhat different costs. Intuitively this makes sense: both pay the cost of cache-line thrashing, but only the true sharing case pays the cost of additional memory ordering violations. On second though, how that would work isn't entirely clear since everything happens in cache line units so I don't thing the snoop would even refer to the more fine-grained location... – BeeOnRope Apr 28 '18 at 20:25