2

given this piece of asm code:

(Line 1) .L6:
(Line 2) movsd -8(%rdx,%rax,8), %xmm0 
(Line 3) .L2:
(Line 4) addsd (%rcx,%rax,8), %xmm0
(Line 5) movsd %xmm0, (%rdx,%rax,8)
(Line 6) addq $1, %rax
(Line 7) cmpq %rax, %r8
(Line 8) jne .L6

1 ) What is the dependency between xmm0 in Line 2 and Line 4? Is my guess of it being Read after Write correct?

2 ) What about the dependency between line 4 and 5? In line 4, it does seem like xmm0 is being both read and written (in that order). And in line 5 it's being read and then copied into the location (%rdx,%rax,8). So is it Read after Read? Or Read After Write?

I'm confused as there are more than 2 (read,write,read) operations happening so not sure which ones would be considered when looking at data dependencies.

this is the c code:

void randomgenericfunction(double a[], double p[], long n)
{
     long i;
     p[0] = a[0];
     for (i=1; i<n; i++) {
         p[i] = p[i-1] + a[i];
     }
     return;
}

Any help will be appreciated, thanks!

Megan Darcy
  • 530
  • 5
  • 15

2 Answers2

2

Line 2 only writes xmm0, line 4 reads and then writes it, line 5 only reads it. So 2 to 4 and 4 to 5 are both read after write.

I suppose you could argue that 4 to 5 is also read after read, but that's not really a dependency since two reads don't have any effect on each other. If line 4 were changed to only read xmm0, and not write it, then it would be perfectly fine for a compiler or CPU from reordering it with line 5. So that second "dependency" isn't worth mentioning.

Nate Eldredge
  • 48,811
  • 6
  • 54
  • 82
  • yep I was mistaken in thinking that the dependency would include the two reads. thanks for clearing that up! :) – Megan Darcy Jun 27 '21 at 16:55
  • Read-after-read isn't any kind of hazard or dependency at all. To be hazardous, there needs to be at least one write. (Just like C non-atomic variables, concurrent unsynchronized reads are safe.) – Peter Cordes Jun 27 '21 at 16:59
2

addsd reads and writes its destination operand, so the read side of that is a RAW relative to the earlier load.

The write side would be a WAW, but the write can't happen until after the read (because addsd can't produce a value until after both its inputs are ready) so you wouldn't normally call that a separate hazard. Any way of handling the RAW dependency in any normal / standard way would already avoid / take care of the WAW hazard.

e.g. register renaming (Tomasulo) would keep track of the fact that addsd is reading the load result, and later reads of XMM0 are reading the addsd result, until the next write creates another version of the register. (A lot like SSA Static Single Assignment). Or in any kind of pipeline, there won't be a value to write back until after the load value has been safely read.

I guess you could imagine a case like mulsd %xmm0, %xmm5 between the load and the addsd, and you need it to read the load result, not the addsd result, even if XMM5 was the result of a cache-miss load and isn't available until long after XMM0, so the addsd could have executed before an earlier read of XMM0. Obviously an in-order pipeline would have stalled waiting for XMM5 to be ready before execution could reach the addsd that reads+writes XMM0, and a register-renaming pipeline (like all modern x86) would handle it via renaming.


dependency between line 4 and 5? So is it Read after Read?

That's not hazardous. It's always safe to have multiple concurrent readers of the same register or value. There's no such thing as a RAR hazard; at least one write must be involved for things to get tricky.

Yes, storing the addsd result is a RAW hazard (true dependency). The store-data uop needs to wait for the result from that FP add.

(Fun fact: on Intel CPUs, the store-address uop runs on a separate execution port, independently of the store data, writing the address into the store-buffer entry that was allocated during issue/alloc/rename when the uops of this instruction entered the out-of-order back-end. The store-address uop only reads %rdx and %rax, so it has a RAW dependency on the addq $1, %rax in the previous iteration, assuming this is a loop.)

So line 5 has a RAW dependency on line 4, and is not directly dependent on line 2 in any way. (The load was earlier in the dependency chain leading to the store, but separated from it by going through the addsd.)


If .L6 is somewhere before .L3 so this is a loop, there's a WAR anti-dependency between the store at the bottom and the load in the next iteration, into XMM0.

See also

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • thanks for the response haha saving me the agony of confusion :) as a follow-up to the question I had, given these dependencies, would it be safe to conclude that the iterative dependent registers in this case is %xmm0, %rdx and %rax? Or am i mistaken and it should be %xmm0 and %rax only? Confused about the movsd %xmm0, (%rdx,%rax,8) line, using rdx as a destination register. – Megan Darcy Jun 27 '21 at 16:51
  • @MeganDarcy: RDX is never written inside the loop. (Or is this not the whole loop? Is there a modification to RDX somewhere outside the part we can see? Where's `.L6`?) Memory *pointed-to* by RDX and RAX is written to, but a store instruction (and the store-address uop) only reads registers, writes [the store-buffer entry](https://stackoverflow.com/questions/64141366/can-a-speculatively-executed-cpu-branch-contain-opcodes-that-access-ram). – Peter Cordes Jun 27 '21 at 16:57
  • apologies I made a typo, .L6 is actually the first part of the asm we see. I have edited accordingly :) – Megan Darcy Jun 27 '21 at 17:12
  • so given the edit, would that mean that %rdx is written in the loop? because the c code (which I will also edit in) shows the variable getting assigned to repeatedly. – Megan Darcy Jun 27 '21 at 17:13
  • 1
    @MeganDarcy: No, like I just said, none of the instructions in the question modify the RDX register, only read it (to use it as a memory address). If there had been an `add $4096, %rdx` earlier in the loop (in a part you'd left out), then there'd be RAW (and WAR across iterations) hazards between that and the addressing modes that read RDX. – Peter Cordes Jun 27 '21 at 17:14
  • so this line movsd %xmm0, (%rdx,%rax,8) like you said earlier, is actually moving the value of %xmm0 into the memory pointed to by (%rdx,%rax,8) ? – Megan Darcy Jun 27 '21 at 17:17
  • 1
    @MeganDarcy: Yes, of course. In AT&T syntax, `(...)` is a memory operand. Just like in your C code, `double *P` is RDX, and P is never modified, only memory pointed-to through P. – Peter Cordes Jun 27 '21 at 17:18
  • okay perfect, thank you so much for being so patient with me hahaha ;) – Megan Darcy Jun 27 '21 at 17:20
  • @MeganDarcy: And BTW, this asm is a really inefficient implementation of it, I guess because there's no `restrict` qualifier and the compiler didn't check for overlap between `P[]` and `a[]` and make a version that keeps the running total in a register. Putting a store/reload as part of the loop carried dependency instead of `P[i] = (sum += a[i])` makes this prefix sum take about twice as long as it should, even without using SIMD. ([SIMD prefix sum on Intel cpu](https://stackoverflow.com/q/10587598)) – Peter Cordes Jun 27 '21 at 17:22