17

Intel's 32-bit processors such as Pentium have 64-bit wide data bus and therefore fetch 8 bytes per access. Based on this, I'm assuming that the physical addresses that these processors emit on the address bus are always multiples of 8.

Firstly, is this conclusion correct?

Secondly, if it is correct, then one should align data structure members on an 8 byte boundary. But I've seen people using a 4-byte alignment instead on these processors.

How can they be justified in doing so?

G S
  • 35,511
  • 22
  • 84
  • 118
  • 1
    I have no idea what this question means, but am intrigued about how this relates to programming, and how this might affect me. Where can I read up a basic intro to this low level type stuff? – Rich Bradshaw Jun 28 '09 at 10:38
  • 5
    See "What Every Programmer Should Know About Memory": http://people.redhat.com/drepper/cpumemory.pdf – Crashworks Jun 28 '09 at 10:56
  • 1
    How do yo uget from "requested reads are always multiples of 8" to "your data should always start on a 8-byte boundary"? I don't see the logical connection between these. As long as the data doesn't *cross* a 8-byte boundary, we're good, aren't we? – jalf Jun 28 '09 at 11:59

5 Answers5

18

The usual rule of thumb (straight from Intels and AMD's optimization manuals) is that every data type should be aligned by its own size. An int32 should be aligned on a 32-bit boundary, an int64 on a 64-bit boundary, and so on. A char will fit just fine anywhere.

Another rule of thumb is, of course "the compiler has been told about alignment requirements". You don't need to worry about it because the compiler knows to add the right padding and offsets to allow efficient access to data.

The only exception is when working with SIMD instructions, where you have to manually ensure alignment on most compilers.

Secondly, if it is correct, then one should align data structure members on an 8 byte boundary. But I've seen people using a 4-byte alignment instead on these processors.

I don't see how that makes a difference. The CPU can simply issue a read for the 64-bit block that contains those 4 bytes. That means it either gets 4 extra bytes before the requested data, or after it. But in both cases, it only takes a single read. 32-bit alignment of 32-bit-wide data ensures that it won't cross a 64-bit boundary.

Cole Tobin
  • 9,206
  • 15
  • 49
  • 74
jalf
  • 243,077
  • 51
  • 345
  • 550
  • Not if the 4 bytes straddle one 64 bit chunk over to the next. – mP. Jun 28 '09 at 12:58
  • how would that happen if it is aligned on a 4-byte boundary? – jalf Jun 28 '09 at 13:28
  • 5
    I can't believe I missed this simple reasoning. Why waste 4 extra bytes in 8-byte alignment when you achieve the same performance with 4 byte? Thanks Jalf. You make perfect sense. – G S Jun 28 '09 at 16:06
  • @jalf I have posted a different question also related to alignment (in this case, about words which size is lesser than the one of the architecture), and I'm not sure if the reasoning applied in your answer can be applied to my question: http://stackoverflow.com/questions/22820576/reading-shorts-in-32-bits-architectures-for-example – ABu Apr 03 '14 at 15:23
  • "The compiler has been told about alignment requirements". But the compiler also has been given a language specification it has to conform to, and in the case of C, that means no member reordering. On many platforms, reordering your structure members can improve performance. Probably not on x86 processors, though it might still decrease memory-footprint. – yyny Feb 16 '19 at 22:32
8

Physical bus is 64bit wide ...multiple of 8 --> yes

HOWEVER, there are two more factor to consider:

  1. Some x86 instruction set are byte addressed. Some are 32bit aligned (that's why you have 4 byte thing). But no (core) instruction are 64bits aligned. The CPU can handle misaligned data access.
  2. If you care about the performance, you should think about the cache line, not main memory. Cache lines are much wider.
J-16 SDiZ
  • 26,473
  • 4
  • 65
  • 84
  • I don't understand. You agree that processors like the Pentium place only multiples 8 on the address bus. Then you say 4-byte alignment is okay. Well, consider the address 0x000044444. Although it is 4-byte aligned, the processor is never going to emit this address on the address line because it's not a multiple of 8. Hence, fetching memory at this address will require two fetches. How is then 4-byte alignment justified? – G S Jun 28 '09 at 11:31
  • 3
    Why would it require two fetches? It would simply request all the data from 0x000044440 to 0x000044447, and since we're interested in 0x000044444-0x000044447, what's the problem? – jalf Jun 28 '09 at 11:54
  • Why are talking about instructions alignment, that makes no sense. Padding instructions to some boundary with NOPs achieves nothing. – mP. Jun 28 '09 at 12:56
  • x87 instructions can do 64-bit loads / stores (including `fild` / `fistp` of 64-bit integer data), and Pentium had that integrated. So can Pentium's `lock cmpxchg8b` which is very slow if it crosses a cache-line boundary. No instructions *require* 32-bit alignment, though. They just benefit from not splitting across a cache-line boundary. – Peter Cordes Apr 30 '19 at 20:25
2

They are justified in doing so because changing to 8-byte alignment would constitute an ABI change, and the marginal performance improvement is not worth the trouble.

As someone else already said, cachelines matter. All accesses on the actual memory bus are in terms of cache lines (64 bytes on x86, IIRC). See the "What every programmer needs to know about memory" doc that was mentioned already. So the actual memory traffic is 64 byte aligned.

janneb
  • 36,249
  • 2
  • 81
  • 97
1

For random access and as long as the data is not misaligned (e.g. crossing a boundary), I don't think that it matters much; the correct address and offset in the data can be found with a simple AND construct in hardware. It gets slow when one read access is not sufficient to get one value. That's also why compilers usually put small values (bytes etc.) together because they don't have to be at a specific offset; shorts should be on even addresses, 32-bit on 4-byte addresses and 64-bit on 8-byte addresses.

Note that if you have caching involed and linear data access, things will be different.

Lucero
  • 59,176
  • 9
  • 122
  • 152
1

The 64 bits bus you refer to feeds the caches. As a CPU, always read and write entire cache lines. The size of a cache line is always a multiple of 8, and its physical address is indeed aligned at 8 byte offsets.

Cache-to-register transfers do not use the external databus, so the width of that bus is irrelevant.

MSalters
  • 173,980
  • 10
  • 155
  • 350