29

I have this assembly directive called .p2align that is being generated by gcc from the source of a C program.

As I understand aligned access is faster that the unaligned one, also an asm program doesn't automatically align the memory locations or optimize memory access, so you have to do this.

I can't really read this .p2align 4,,15, especially the last part, that 15.

Skipping the fact that apparently gcc generates 2 , instead of just 1, as reported by many docs; what I get is that this piece of asm aligns memory in such a way that each location occupies 2^4 bits, which means 16 bit, so I think that it's fair to say that a WORD is 16 bit long in this case.

Now what 15 possibly means ? It's a number of bits for what ? Does the counting start from 0 so the "real" quantity is 16 instead of 15 ?

EDIT:

I just translated the same C source to both 32 bit and 64 asm code, the memory is always aligned in the same exact way with the same directive .p2align 4,,15. Why is that ?

Ori
  • 4,132
  • 1
  • 25
  • 27
user2485710
  • 9,451
  • 13
  • 58
  • 102

2 Answers2

29

The .p2align directive is documented here.

The first expression is the power-of-two byte alignment required. .p2align 4 pads to align on a 16-byte boundary. .p2align 5 - a 32-byte boundary, etc.

The second expression is the value to be used as padding. For x86, it's best to leave this and let the assembler choose, since there are a range of instructions that are effective no-ops. In some alignment directives, you'll see 0x90, which is the NOP instruction.

The final expression is the maximum number of bytes for padding - if the alignment requires more than this, skip the directive. In this case - 4,,15 - it does nothing, since 15 is the maximum number of bytes required to yield 16-byte alignment anyway.

Brett Hale
  • 21,653
  • 2
  • 61
  • 90
  • this is even more weird, so for both the 32 bit and the 64 bit mode I get an alignment based on 128 bits blocks of memory ? This also means that inside that 128 bits everything can be non-aligned ? With what logic I pick the right alignment ? I understand what this directive express now, I don't get why it goes for 128 bit on a 32 bit machine or 64 bit one. – user2485710 Feb 04 '14 at 08:48
  • 3
    @user2485710 - forget bits. It's byte alignment that's important. While x86 doesn't *need* code to be aligned, it improves performance for loops, data access, etc. (A complex subject). Also, functions are expected to start on particular alignments for linker requirements. – Brett Hale Feb 04 '14 at 08:54
  • Where and in what kind of documents I can get more informations ? I really don't understand what is going on here ... – user2485710 Feb 04 '14 at 09:14
  • 4
    @user2485710 - AMD/Intel provide resources for developers. Agner Fog's [optimization manuals](http://www.agner.org/optimize/) are excellent. You can also compile simple functions and work through the assembly yourself. – Brett Hale Feb 04 '14 at 11:23
  • E.g. Linux kernel uses it as follows `#define __ALIGN .p2align 4, 0x90` – vmemmap Sep 17 '22 at 19:44
4

The p2 part of the directive name came from gas being possibly the original implementation of the recommendation for Intel P-II CPU to provide conditional alignment of loop body code. As Agner Fog explains, the original purpose was to ensure that the first instruction fetch gets sufficient code to begin decoding.

There is also an interaction with the Loop Stream Detector, which may fail to kick in if there are extra instruction cache line fragments used at the top and bottom of the loop. Alignment is made conditional so as to avoid consuming more memory than necessary, and to avoid excessive time requirement in the case where the padding bytes are executed. gcc makes different choices of alignment, depending on the mtune target setting.

There have been targets where 2 alignment directives are set, for example to make unconditional 8-byte alignment and conditional 32-byte alignment. The reason for choosing various nop patterns is to minimize the time taken in the case where the padding stream is executed (when execution enters the loop from above). For example, a prefixed instruction which copies a register to itself can consume code bytes faster than single byte nops. This makes no difference in the case originally alluded to in this thread. So, part of the confusion may come from this alignment directive having features which aren't relevant to setting data alignments, although the directive is used also for that purpose.

Michael
  • 57,169
  • 9
  • 80
  • 125
tim18
  • 580
  • 1
  • 4
  • 8
  • I have a case today with .p2align 4,,10 where the padding limit would need to be increased by 1 to avoid a 30% performance reduction. I'm wondering what options gcc has to change the padding limit. – tim18 Sep 07 '15 at 14:49