x86 store when data is in 2 different blocks

Question

Supose linux-32: the aligment rules say, for example, that doubles (8 Bytes) must be aligned to 4 Bytes. This means that, if we assume 64 Bytes cache blocks (a typical value for modern processors) we can have a double aligned in the 60th position, which mean that this double will be in 2 different cache blocks. It could even happen that both parts of the double were in 2 different cache blocks located in 2 different 4KB pages.

After this brief introduction to put the question in context, I have a couple of doubts:

1- For an assembler programming where we seek maximum performance, it is recommended to prevent these things from happenning by putting alignment directives, right? Or, for any reason that I unknow, making the alignment to make the double in only 1 block doesn't imply any performance change?

2- How will be the store instruction decoded in the in the mentioned case? (supose modern intel microarchitecture). I mean, I know that a normal store x86 instruction is decoded in a micro-fused pair of str-addr and str-data, but in this case where 2 different cache blocks (and maybe even 2 different 4KB pages) are involved, this will be decoded in 2 micro-fused pair of str-addr and str-data (one for the first 4 bytes of the double and another for the last 4 bytes)? Or it will be decoded to a single micro-fused pair but having to do both the str-addr and the str-data twice the work until finally being able to exit the execution port?

Peter Cordes · Accepted Answer · 2020-07-23T11:35:17.133

2

Yes, of course you should align a double whenever possible, like compilers do except when forced by ABI struct-layout rules to misalign them. (The ABI was designed when i386 was current so a double always required 2 loads anyway.)

The current version of the i386 System V ABI requires 16-byte stack alignment, so local doubles (that have to get spilled at all instead of kept in regs) can be aligned, and malloc has to return memory suitable for any type, and alignof(max_align_t) = 16 on 32-bit Linux (8 on 32-bit Windows) so 32-bit malloc will always give you at least 16 (or 8)-byte aligned memory. And of course in static storage you control the alignment with align (NASM) or .p2align (GAS) directives.

For the perf downsides of cacheline splits and page splits, see How can I accurately benchmark unaligned access speed on x86_64

re: decoding: The address isn't know at decode time so obviously any effects of a line-split page-split are resolved later. For stores, probably no effect until the store-buffer entry has to commit to L1d cache. Are two store buffer entries needed for split line/page stores on recent Intel? - probably no, allocating a 2nd entry after executing the store-address uop is implausible.

For loads, re-running the load through the execution unit to get the other half (or whatever uneven split), using internal line-split buffers to combine data. (Not re-dispatching from the RS, just internally handled in the load port. But the RS does aggressively replay uops waiting for the result of a load.)

Re-running the store-data uop for a misaligned store seems unlikely, too. I don't think we see extra counts for uops_dispatched_port.port_4 perf events.

edited Jul 23 '20 at 11:35

answered Jul 23 '20 at 11:21

Peter Cordes

328,167
45
605
847

And do you know the answer to the second question? It is the one that makes me most curious: Knowing what happens if I decide not to align the data and that data is located between 2 blocks (or even between 2 pages). (maybe, in the moment I'm asking this, you are doing an edit to your answer to include this. If it's the case I will delete this comment) – isma Jul 23 '20 at 11:29
@isma: Oh, when I skimmed the question, I didn't realize you meant literal *decode*. [How can I accurately benchmark unaligned access speed on x86\_64](https://stackoverflow.com/q/45128763) I think answers most of the other uop details you asked about,. – Peter Cordes Jul 23 '20 at 11:38
The question " https://stackoverflow.com/questions/61180829/are-two-store-buffer-entries-needed-for-split-line-page-stores-on-recent-intel " was the question that literally gave me my doubt. But after reading the links you gave me there are some things that still don't get: 1- what is the RS? 2- I think that I have understood the load part: 2 loads will be done for the 2 halfs of the data and all will be handled in the load port in "1 execution", I mean, without the necessity to leave the load port and re-enter later to resolve the other half. Right? 3- (continue following comment) – isma Jul 23 '20 at 11:51
1

@isma: RS = reservation station = scheduler. – Peter Cordes Jul 23 '20 at 11:52
3- I haven't understood how is the store handled. I mean, the instruction is decoded only in a micro-fused pair str-addr and str-data. And I have understood that this will be done in only 1 store-buffer entry like a normal store (e.g. the 8 double bytes will be stores there like a normal store). But what will happen later when it has to "commit" to the L1d? – isma Jul 23 '20 at 11:54
@isma: Presumably that one SB entry will take 2 cycles to commit to 2 separate lines of L1d. We don't know the details. Of course for page-splits, the store-address uop will have already checked the TLB for both virtual pages (which might not map to contiguous physical pages), so some mechanism needs to handle this possibility. We're not sure where exactly it puts the necessary info or what it does with it, otherwise someone would have answered Bee's question :P – Peter Cordes Jul 23 '20 at 12:05
Maybe you were bullying BeeOnRope and you didn't want to answer him the question! Hahaha, just kidding XD /// That both TLB checking are done in the commit to L1d from the store-buffer, right? /// And I assume what I said in a previous comment about the load part (which starts with: "2- I think that I have understood the load part: 2 loads will ...") is correct since you didn't me the opposite. Forgive me that sometimes I want to make sure that I have understood some things well, I do it just in case I have misunderstood due to the language. And as always, Peter, thanks for your patience :-D – isma Jul 23 '20 at 12:16
@isma: No, TLB checking has to be done during execution. Waiting until after retirement (at commit) would leave no way to handle a faulting store! You also need to use the value of the TLB / page tables that was in effect when the store executed, not some possible later value for the other page (if the store-address uop left one untranslated). It would also require a connection from the commit stage to the L1dTLB, instead of just from the load / store-address ports. – Peter Cordes Jul 23 '20 at 12:20
@isma: I find it a pretty annoying waste of my time when people ask me to re-state something that I already said pretty clearly, especially if it's also something you can read more about in other places (like split loads). Didn't I link [Are load ops deallocated from the RS when they dispatch, complete or some other time?](https://stackoverflow.com/q/59905395) and the moved-to-chat comment thread just the other day, about replays of uops dependent on split loads, but not the load itself? Yes, the load is not re-dispatched from the RS, just takes another cycle sometime later in the load port. – Peter Cordes Jul 23 '20 at 12:25
And I also just said that in my answer. If you did an experiment with perf counters and found something that didn't match your understanding of what I said, I'd be happy to take a look. But seriously, please don't make me tell you everything twice, I don't enjoy that kind of conversation at all. – Peter Cordes Jul 23 '20 at 12:27
But you previously said "the store-address uop will have already checked the TLB for both virtual pages", but since it's only 1 uop str-addr which has to to deal with that 8 byte double that was split between 2 pages, that uop str-addr will know at running time that it has to check 2 virtual pages in TLB? And thousand apologies about reasking to be sure of what I asked before! I did not know it bothered you so much, I will try not to do it anymore, my last intention was to bother you. – isma Jul 23 '20 at 12:28
About doing myself experiments: If I don't test some things, it's because I don't have an Intel processor and I think it can't be tested if you don't have the CPU where you want to test. Or, at least, I don't know any mechanism to do that :( – isma Jul 23 '20 at 12:31
@isma: (just for the record, *that's* the kind of follow up question I don't mind at all. That's a good question with a non-obvious answer, that I didn't already just explain here or in a link). That's why part of why page-splits are slow even for stores :P The store-address uop knows the size, and it knows the final address it generated. It can easily check the low 12 bits (the page-offset part) for being `<= (4096-8)`, or maybe just checking if 12-bit `addr+len` produces a carry-out. – Peter Cordes Jul 23 '20 at 12:34
1

This special case could trigger an internal exception or something (pre-Skylake page-splits cost over 100 cycles, perhaps like a microcode assist for subnormal FP that basically flushes the pipeline). Or on Skylake apparently they handle it with only a minor penalty, maybe similar to how a cache-split load uop gets itself re-executed later on the same port to handle the other side. – Peter Cordes Jul 23 '20 at 12:36
Re: testing: that's a problem. If you're really interested in *Intel* microarchitectures instead of whatever you have (AMD?), maybe pick up a 2nd-hand Haswell or Skylake laptop or desktop. Having some sanity checks on understanding the basics is important. I think I've seen something that let you run code on cloud servers with perf counters (like quickbench.com but with profiling). I think it was http://www.peachpy.io/ which is down at the moment. But I think you could only submit asm in some kind of Python syntax (to be JIT and run), not plain NASM source. – Peter Cordes Jul 23 '20 at 12:41
Oh I didn't know that in the store address unit checks were made to see if the data was between 2 pages and therefore needed to do 2 TLB look-ups. This is very interesting to know! And I'm glad that you found my last question about the TLB interesting to answer. The truth is that I had felt bad because some of my questions seemed redundant, since you are more helpful than any of my college professors. (and yes, AMD user :-( ). Cheers, Peter! – isma Jul 23 '20 at 12:47
1

AMD's internals have many similarities to Intel's. They have a patent-sharing agreement so when there's only one good way (or one clear best) way to solve a problem, the internals are often very similar. I assume it's fairly well supported by Linux `perf`, so you could certainly try stuff on your own machine. Zen is a very relevant microarchitecture at the moment, if that's what you have, so it would be interesting to see questions about its internals. – Peter Cordes Jul 23 '20 at 12:52

x86 store when data is in 2 different blocks

1 Answers1