ARM SVE: svld1(mask, ptr) vs svldff1(svptrue<>, ptr)

Question

In ARM SVE there are masked load instructions svld1and there are also non-failing loads svldff1(svptrue<>).

Questions:

Does it make sense to do svld1 with a mask as opppose to svldff1?
The behaviour of mask in svldff1 seems confusing. Is there a practical reason to provide a not just svptrue mask for svldff1
Is there any performance difference between svld1 and svldff1

If you're not expecting to be near the end of a buffer, a normal load will fault instead of returning wrong data if there's a bug in your program. (e.g. you might be doing masked loads with a mask from some other compare result, not related to handling the start/end of an array). I don't know about performance. — Peter Cordes, Dec 17 '22 at 21:42
That's true. If bug prevention is the only reason - that's OK. — Denis Yaroshevskiy, Dec 17 '22 at 21:46

Daniel Lemire · Accepted Answer · 2023-01-05T14:07:03.727

3

Both ldff1 and ld1 can be used to load a vector register. According my informal tests, on an AWS graviton processor, I find no performance difference, in the sense that both instructions (ldff1 and ld1) seem to have roughly the same performance characteristics. However, ldff1 will read and write to the first-fault register (FFR). It implies that you cannot do more than one ldff1 at any one time within an 'FFR group', since they are order sensitive and depend crucially on the FFR.

Furthermore, the ldff1 instruction is meant to be used along with the rdffr instruction, the instruction that generates a mask indicating which loads were successful. Using the rdffr instruction will obviously add some cost. I am assuming that the instruction in question might need to run after ldff1w, thus increasing the latency by at least a cycle. Of course, then you have to do something with the mask that rdffr produces...

Obviously, there is bound to be some small overhead tied to the FFR (clearing, setting, accessing).

"Is there a practical reason to provide a not just svptrue mask for svldff1": The documentation states that the leading inactive elements (up to the fault) are predicated to zero.

edited Jan 05 '23 at 14:07

answered Dec 19 '22 at 17:07

Daniel Lemire

3,470
2
25
23

Do I read you correctly that seems like you are likely cannot do more than one ldff1 in parallel and they'd have to be sequenced? – Denis Yaroshevskiy Dec 19 '22 at 17:16
1

Within an FFR group, the loads should be sequential: the value of the FFR before a load does affect which elements of the load result are defined, and the loads do still write to the FFR. – Daniel Lemire Dec 19 '22 at 17:20
1

Practically, I expect that, yes, you cannot do more than one ldff1 in parallel. – Daniel Lemire Dec 19 '22 at 17:27
1

@DenisYaroshevskiy: I'd guess that the FFR is renamed so multiple `ldff1` loads could be in flight. But if not, then there'd be some serialization, at least in doing the TLB checks for them. (Faulting or not is determined by the page-tables, and doesn't have to wait for the data from a cache miss.) – Peter Cordes Dec 19 '22 at 18:02
1

@PeterCordes Within an FFR group, I don't think you have much freedom because there is data dependency. Of course, speculation is possible... but seems unlikely. – Daniel Lemire Dec 26 '22 at 18:33
1

@DanielLemire: Right, but you could have one `ld1ff` and some instructions that read that mask, and another `ld1ff` from a different address and some other instructions that read that mask. With register renaming, those two dep chains can overlap instead of being serialized by WAW and WAR hazards. (Especially hit under miss or miss under miss, or just both hit and separate ALU dep chains all needing to read two different masks.) – Peter Cordes Dec 26 '22 at 19:59
1

@PeterCordes Right. I used the qualifier 'practically' because I expect that it is not something you can actually do, even if it is theoretically possible. That's only a guess on my part, of course. But it is an empirical question: one just has to run an experiment and show that you can have multiple ld1ff in flight at once. – Daniel Lemire Jan 05 '23 at 14:13
1

@DanielLemire: Ok yeah, not renaming FFR would be a possible design, but it seems to me like there are lots of use-cases where renaming could avoid big stalls for WAW / WAR hazards. So for a high-end core that's already going to spend the transistors to support SVE at all and do out-of-order exec with register-renaming of vector regs, also renaming FFR is a design choice I'd expect. On an in-order core then probably not, and you'd want to be more careful about scheduling to process a line likely to cache-miss last. – Peter Cordes Jan 05 '23 at 14:33
1

@DanielLemire: As you say, it should be fairly straightforward to microbenchmark, and that's the only way to know for sure on any given core. I could certainly be wrong in my guesses about what would make sense for ARM design choices. But with only one FFR, you don't need *many* registers to rename onto, and it could have its own RAT and small register file, not increasing the size or width of the existing RAT for vector regs. (Cost scales more than linearly with number of registers to be renamed per cycle, I think, if they're all of the same type renaming onto the same pool.) – Peter Cordes Jan 05 '23 at 14:34
1

I agree. To be investigated. – Daniel Lemire Jan 06 '23 at 22:55

ARM SVE: svld1(mask, ptr) vs svldff1(svptrue<>, ptr)

1 Answers1