What <4GB workloads would have worse performance in the Linux x32 ABI than x64?

Question

There is a relatively new Linux ABI referred to as x32, where the x86-64 processor runs in 32-bit mode, so pointers are still only 32-bits, but the 64-bit architecture specific registers are still used. So you're still limited to 4GB max memory use as in normal 32-bit, but your pointers use up less cache space than they do in 64-bit, you can do 64-bit arithmetic efficiently, and you get access to more registers (16) than you would in vanilla 32-bit (8).

Assuming you have a workload that fits nicely within 4GB, is there any way the performance of x32 could be worse than on x86-64?

It seems to me that if you don't need the extra memory space nothing is lost -- you should always get the same perf (when you already fit in cache) or better (when the pointer space savings lets you fit more in cache). But it wouldn't surprise me if there are paging/TLB/etc. details that I don't know about.

The evil is in the details, so I won't be very surprised if on some rare occasions, in your conditions, sometimes x32 could be a little bit worse than x86-64. But I don't believe it is common.... (you could imagine that alignment constraints are less strong on x32, and that might rarely hurt the cache performance). — Basile Starynkevitch, Oct 15 '12 at 20:07
Keep in mind that pointer size is not the only difference between the two ABIs - x86-64 also has more registers, which can reduce the number of load/store instructions, and quite a few other differences. As a result, there's not really a simple answer to this question, and benchmarking/testing would almost always be the best route to determine which is "better" by whatever definition of "better" is important to that particular project. — twalberg, Oct 15 '12 at 20:49
@twalberg: I think you may have misread the question -- x32 and x86-64 have the same number of registers. I'm not talking about normal 32-bit. — Joseph Garvin, Oct 15 '12 at 21:04
possible duplicate of [Are 64 bit programs bigger and faster than 32 bit versions?](http://stackoverflow.com/questions/2378399/are-64-bit-programs-bigger-and-faster-than-32-bit-versions) — Ben Voigt, Oct 15 '12 at 21:31
@JosephGarvin Ah... nevermind... I was thinking of x86-64 running in the legacy 32-bit mode, not running in long mode with self-imposed restrictions... — twalberg, Oct 15 '12 at 21:41
@BenVoigt: That's not the same question, x32 != 32-bit. Someone actually asks about this in the first comment on the first answer there, so I don't think this is covered. — Joseph Garvin, Oct 15 '12 at 23:50

score 9 · Answer 1 · answered Oct 15 '12 at 20:23

9

Certainly if you have a multithreaded program, the fact that data structures are smaller on x32 might cause cache line fighting between threads -- different objects might get allocated on the same cache line in x32 mode and different cache lines in x86_64 mode. If two threads modify those objects independently the cache ping-ponging could severely slow down the x32 code. Of course, this kind of cache effect could happen regardless of pointer size, but if the code has been tuned assuming 64-bit pointers, going to 32-bit pointers could de-tune things.

answered Oct 15 '12 at 20:23

Chris Dodd

119,907
13
134
226

2

+1 for being possible, but in practice you should be aligning any data that might be touched by two threads to a cache line. – Joseph Garvin Oct 15 '12 at 21:06
1

@JosephGarvin: Yes, but the alignment may have been done assuming a particular pointer size. If someone pads stuff so it fills a cache line with 64-bit pointers, changing to 32-bit pointers without updating the padding may be a problem. This is mostly just an issue if you're taking existing, tuned source code and recompiling in x32 mode with no changes. – Chris Dodd Oct 15 '12 at 22:21

score 3 · Answer 2 · answered Oct 15 '12 at 21:08

3

In X32 the processor is actually executing in "long mode", the same mode as for x86_64. That is, addresses as seen by the processor when doing addressing are still 64 bits, however the X32 ABI makes sure that all addresses are small enough to fit into 32 bits. As a result of this, in some case there is some slight overhead when pointers have to be zero extended from 32 bits to 64.

Also, needing x86/x86-64/x32 libraries in RAM, which I suppose is what one will end up with in practice (unless you're talking about some embedded or other tightly controlled system rather than a general purpose computer), may eat up some of the benefit of X32.

answered Oct 15 '12 at 21:08

janneb

36,249
2
81
97

1

Aren't the pointers actually sign-extended? And I don't believe there's any performance penalty to a 32-bit load or store instruction in long mode, both sign extension and zero extension are extremely cheap operations handled in hardware during the same cycle (no delay added). – Ben Voigt Oct 15 '12 at 21:39
1

I think embedded and tightly controlled systems is the intended target, so I doubt the library RAM usage issue would crop up. – Joseph Garvin Oct 15 '12 at 23:53
@BenVoigt: They might indeed be sign extended rather than zero, I forget which. And no, there is no penalty for 32-bit load/store in long mode, rather the opposite as rXX register encodings take more space than the 32-bit reg encodings. And yes, sign/zero extensions are very cheap, though they do take up a tiny bit of decoder BW and bloat the code. – janneb Oct 16 '12 at 08:21
@JosephGarvin: So do I, though a lot of the excitement around X32 on the interwebs seem to come from people who are excited by some desktop benchmark potentially going a few percent faster. :-/ – janneb Oct 16 '12 at 08:23
nowadays some macOS is 64-bit only and many Linux distros don't have 32-bit support by default, so it's possible to have x32 only or x32/x86-64 only to save library memory usage – phuclv Oct 28 '22 at 03:50
@phuclv In principle. In practice X32 is pretty much dead. – janneb Oct 28 '22 at 05:48
1

GCC targeting x32 often uses a `67h` address-size prefix on every instruction with a memory operand, so it doesn't need to sign-extend `int` / `long` / `intptr_t` from 32-bit to 64-bit for use with addressing modes like `[rdi + rsi*4]`. Instead it uses `[edi + esi*4]`, wrapping after binary addition (so it works for signed negative) *before* hardware zero-extends to 64-bit. GCC's still somewhat naive about knowing that it can use `[rdi + 4]` (since the offset is known non-negative) instead of `[edi+4]`, but at least it avoids 67h for RSP stack addresses. – Peter Cordes Oct 28 '22 at 14:00
1

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82267 - back in 2018, `-mx32` would even do `movl (%edi), %eax` ; `movq (%eax), %rax` for two derefs of a `long long **p`. The `(%edi)` is maybe needed if the calling convention allows high garbage, the 2nd is definitely not. Neither `-maddress-mode=long` nor `-maddress-mode=short` are optimal for code-size. – Peter Cordes Oct 28 '22 at 14:03

What <4GB workloads would have worse performance in the Linux x32 ABI than x64?

2 Answers2