19

I was reading a article about data types alignment in memory(here) and I am unable to understand one point i.e.

Note that a double variable will be allocated on 8 byte boundary on 32 bit machine and requires two memory read cycles. On a 64 bit machine, based on number of banks, double variable will be allocated on 8 byte boundary and requires only one memory read cycle.

My doubt is: Why double variables need to be allocated on 8 byte boundary and not on 4 byte? If it is allocated on 4 byte boundary still we need only 2 memory read cycles(on a 32 bit machine). Correct me if I am wrong.

Also if some one has a good tutorial on member/memory alignment, kindly share.

CharlesB
  • 86,532
  • 28
  • 194
  • 218
Ravi Gupta
  • 6,258
  • 17
  • 56
  • 79
  • 6
    See this answer: http://stackoverflow.com/a/9468315/612429 – Kijewski Jun 06 '12 at 11:22
  • 4
    It matches cache alignment, and also SSE instruction requirements. – Oliver Charlesworth Jun 06 '12 at 11:40
  • 2
    All this depends on the hardware architecture and not on C. – m0skit0 Jun 06 '12 at 13:47
  • @m0skit0: if everything is arch dependent then why it different for different compilers ... `A double (eight bytes) will be 8-byte aligned on Windows and 4-byte aligned on Linux (8-byte with -malign-double compile time option).` ... source http://en.wikipedia.org/wiki/Data_structure_alignment – Ravi Gupta Jun 07 '12 at 05:00
  • 1
    @OliverCharlesworth: SSE has no 8-byte-alignment-required loads/stores. It's either 16-byte alignment required for 16-byte loads/stores, or no alignment required for any narrower operands. But yes it's good for performance to make doubles 8-byte aligned so they can't split across cache lines. (Or across any other boundaries wider than 8 bytes, for CPUs that care about alignment within a cache line). – Peter Cordes Oct 15 '19 at 05:02

4 Answers4

20

The reason to align a data value of size 2^N on a boundary of 2^N is to avoid the possibility that the value will be split across a cache line boundary.

The x86-32 processor can fetch a double from any word boundary (8 byte aligned or not) in at most two, 32-bit memory reads. But if the value is split across a cache line boundary, then the time to fetch the 2nd word may be quite long because of the need to fetch a 2nd cache line from memory. This produces poor processor performance unnecessarily. (As a practical matter, the current processors don't fetch 32-bits from the memory at a time; they tend to fetch much bigger values on much wider busses to enable really high data bandwidths; the actual time to fetch both words if they are in the same cache line, and already cached, may be just 1 clock).

A free consequence of this alignment scheme is that such values also do not cross page boundaries. This avoids the possibility of a page fault in the middle of an data fetch.

So, you should align doubles on 8 byte boundaries for performance reasons. And the compilers know this and just do it for you.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • What is the problem with alignment at 4 byte boundary then? it would still need 2 cycles for 32 bit system – Raman Nov 23 '18 at 09:06
  • @Raman: you aren't considering the cost of reading the 2nd 32 bits from a location that causes a cache line to be fetched from main memory. Such fetches take tens of nanoseconds, in contrast to "1 cycle" taking 0.2 ns, so its lots more than just 1 cycle. This may be rare but its pretty expensive if it happens. – Ira Baxter Nov 23 '18 at 09:33
  • The x87 FPU in CPUs as old as P5 Pentium can load 64 bits at once from cache. That's why gcc chooses to give `double` 8-byte alignment even with `-m32`, except in structs where the i386 System V ABI may force it to be misaligned. [Double stack alignment question using gcc compiler for x86 architecture](//stackoverflow.com/q/58387218). All this talk of 32-bit CPUs not being able to fetch a whole double is nonsense in 2012; that's just the *integer* register width. That was true historically and the reason for the ABI design, though. – Peter Cordes Oct 15 '19 at 05:22
3

Aligning a value on a lower boundary than its size makes it prone to be split across two cachelines. Splitting the value in two cachlines means extra work when evicting the cachelines to the backing store (two cachelines will be evicted; instead of one), which is a useless load of memory buses.

Benny
  • 4,095
  • 1
  • 26
  • 27
1

8 byte alignment for double on 32 bit architecture doesn't reduce memory reads but it still improve performance of the system in terms of reduced cache access. Please read the following : https://stackoverflow.com/a/21220331/5038027

Community
  • 1
  • 1
Nitin
  • 145
  • 9
-2

Refer this wiki article about double precision floating point format

The number of memory cycles depends on your hardware architecture which determines how many RAM banks you have. If you have a 32-bit architecture and 4 RAM banks, you need only 2 memory cycle to read.(Each RAM bank contributing 1 byte)

Manik Sidana
  • 2,005
  • 2
  • 18
  • 29
  • Don't understand the comment about needing only one memory cycle. First of all, "double precision" usually means 8 byte floating point numbers, secondly, 32 bit architecture normally implies a 32 bit data bus. It's impossible to get 64 bits down a 32 bit pipe in one operation no matter how you organise the RAM> – JeremyP Jun 06 '12 at 13:03
  • There was a type error. Rephrasing again:A 32-bit machine with 4 RAM banks would access 8 bytes in 2 memory cycles. – Manik Sidana Jun 07 '12 at 07:55