Are IA-32 segment descriptors that do not cover the full 4GiB linear address space slower?

Question

I am wondering if using segment descriptors that do not cover the whole linear address space are slower then using ones that do?

I am hoping that there is no difference in speed.

Peter Cordes · Accepted Answer · 2022-03-01T04:28:31.590

Technically no: having a smaller limit isn't slower AFAIK.

But having a non-zero base increases load-use latency by 1 cycle on modern mainstream CPUs like Skylake. (i.e. an extra cycle for address generation.) CPU hardware has an optimistic fast path to skip that addition in the normal case of it being zero, otherwise it's part of the critical path for load latency.

That will be most noticeable in pointer-chasing scenarios like linked lists, (binary) trees, and other cases where a load address being ready is part of the critical path dependency chain for data depending on that load. Otherwise out-of-order exec can mostly hide it.

Load-use latency for GP-integer regs on modern Intel CPUs is 5 cycles (with a zero segment base) for an L1d hit, so an extra 1 cycle is significant for workloads sensitive to that. (Or 4 cycles in pointer-chasing use cases (when the base reg itself came from a load) on some Intel CPUs: see Is there a penalty when base+offset is in a different page than the base? for details about their optimistic TLB lookup using the base reg instead of waiting for the proper AGU result. But I think they dropped it for Ice Lake, so it's always 5 cycles.)

Non-zero segment bases are still used for thread-local storage, so CPUs do still support them without disastrous performance penalties. (e.g. not trapping to a microcode assist like sub-normal FP values in some cases.) But it does cost some performance.

See Agner Fog's microarch guide (https://agner.org/optimize/), and other links in https://stackoverflow.com/tags/x86/info. Also I think Intel's optimization manual mentions this for Intel CPUs; I haven't checked AMD's.

For example for AMD K8/K10, Agner writes:

AMD manuals say that the branch misprediction penalty is 10 clock cycles if the code segment base is zero and 12 clocks if the code segment base is nonzero. In my measurements, I have found a minimum branch misprediction penalty of 12 and 13 clock cycles, respectively.

And re: data load/stores on K8/K10:

The time it takes to calculate an address and read from that address in the level-1 cache is 3 clock cycles if the segment base is zero and 4 clock cycles if the segment base is nonzero, according to my measurements. Modern operating systems use paging rather than segmentation to organize memory. You can therefore assume that the segment base is zero in 32-bit and 64-bit operating systems (except for the thread information block which is accessed through FS or GS). The segment base is almost always nonzero in 16-bit systems in protected mode as well as real mode

I don't actually see a mention of the extra latency on Intel CPUs in Agner's guide, since it's irrelevant for most modern uses. So check Intel's optimization guide.

If you're using a segmented memory model, you'll probably end up using segment-override prefixes occasionally; some CPUs have limits on number of prefixes on a single instruction they can decode efficiently, but at least 32-bit mode won't have REX prefixes. pmovzxbd xmm0, [es: eax] will use 3 prefixes (2 mandatory as part of an SSE4.1 instruction, plus ES), and an escape byte, which would be a problem for early Silvermont-family (but not I think later low-power cores). In 32-bit mode you don't have REX prefixes, so that helps avoid that limit.

Agner Fog's microarch guide says there's no special penalty for decoding a segment override on most CPUs.

Also, Intel CPUs (Core 2, Nehalem, and Sandybridge-family at least) do seem to rename segment registers, so modifying them like mov es, eax doesn't have to serialize out-of-order exec. But it's still not cheap, like 10 fused-domain uops for the front-end on Skylake, with 18c throughput. (But can pipeline with writes to other segment regs). See Is a mov to a segmentation register slower than a mov to a general purpose register? for some more details and test results.

I wanted to have the segment bases start at 1 so I would not need to check for NULL pointers and still be able to use page 0. I wanted to use segmentation so that the kernel would not have to explicitly check userspace memory addresses. The general protection fault handler can handle bad addresses from userspace. Do you think this would heavily effect performance? — cbot, Mar 01 '22 at 04:38
@cbot: How would that even help with NULL pointers? You'd still use pointers that were just 32-bit, i.e. the "offset" part of the seg:off logical address. The linear address would then be `1+offset`, and linear=virtual address `1+0 = 1` doesn't trigger any segment-related exception. Normally what you do is leave the entire low page unmapped, so NULL deref triggers a #PF exception. (Even with stuff like `ptr->member` which would typically compile to an access like `[eax + 8]` or whatever. Or array deref. Linux uses [`mmap_min_addr`](https://wiki.debian.org/mmap_min_addr) = 4096 or 65536. — Peter Cordes, Mar 01 '22 at 04:45
@cbot: Or to put it another way, I suspect you were thinking that base:limit is an allowed range for pointers. It's not, the base is added to the "pointer" when you use it as the offset part of a seg:off address, that's why it's on the critical path for address-generation, unlike just checking in parallel with TLB/cache access like the limit check is. That's also why a non-zero FS or GS base is usable for thread-local storage. — Peter Cordes, Mar 01 '22 at 04:48
My bad I thought it was an allowed range, I need to read up more on segmentation. I have everything setup as a flat address space. That makes sense for the fs and gs. Yeah ill just leave the bottom page unmapped. So just to confirm, having a limit that is smaller then the full 4GiB won't cost any performance? — cbot, Mar 01 '22 at 04:56
@cbot: Right, I haven't ever seen a mention of that being slower on any CPU, and it makes sense that it wouldn't be since it's not on the critical path for address-generation in a load/store unit. Just like L1d miss or TLB miss or permission error, it can be checked in parallel with the load/store marked as faulting if it reaches retirement. (We know from Meltdown that that's how Intel handles TLB permission errors.) There aren't two different non-faulting behaviours based on the limit. — Peter Cordes, Mar 01 '22 at 05:24

Are IA-32 segment descriptors that do not cover the full 4GiB linear address space slower?

1 Answers1