How can I prevent functions from being aligned to 16 bytes boundary when compiling for X86?

Question

I'm working in an embedded-like environment where each byte is extremely precious, much more so than additional cycles for unaligned accesses. I have some simple Rust code from an OS development example:

#![feature(lang_items)]
#![no_std]
extern crate rlibc;
#[no_mangle]
pub extern fn rust_main() {

    // ATTENTION: we have a very small stack and no guard page

    let hello = b"Hello World!";
    let color_byte = 0x1f; // white foreground, blue background

    let mut hello_colored = [color_byte; 24];
    for (i, char_byte) in hello.into_iter().enumerate() {
        hello_colored[i*2] = *char_byte;
    }

    // write `Hello World!` to the center of the VGA text buffer
    let buffer_ptr = (0xb8000 + 1988) as *mut _;
    unsafe { *buffer_ptr = hello_colored };

    loop{}

}

#[lang = "eh_personality"] extern fn eh_personality() {}
#[lang = "panic_fmt"] #[no_mangle] pub extern fn panic_fmt() -> ! {loop{}}

I also use this linker script:

OUTPUT_FORMAT("binary")
ENTRY(rust_main)
phys = 0x0000;
SECTIONS
{
  .text phys : AT(phys) {
    code = .;
    *(.text.start);
    *(.text*)
    *(.rodata)
    . = ALIGN(4);
  }
  __text_end=.;
  .data : AT(phys + (data - code))
  {
    data = .;
    *(.data)
    . = ALIGN(4);
  }
  __data_end=.;
  .bss : AT(phys + (bss - code))
  {
    bss = .;
    *(.bss)
    . = ALIGN(4);
  }
  __binary_end = .;
}

I optimize it with opt-level: 3 and LTO using an i586 targeted compiler and the GNU ld linker, including -O3 in the linker command. I've also tried opt-level: z and a coupled -Os at the linker, but this resulted in code that was bigger (it didn't unroll the loop). As it stands, the size seems pretty reasonable with opt-level: 3.

There are quite a few bytes that seem wasted on aligning functions to some boundary. After the unrolled loop, 7 nop instructions are inserted and then there is an infinite loop as expected. After this, there appears to be another infinite loop that is preceded by 7 16-bit override nop instructions (ie, xchg ax,ax rather than xchg eax,eax). This adds up to about 26 bytes wasted in a 196 byte flat binary.

What exactly is the optimizer doing here?
What options do I have to disable it?
Why is unreachable code being included in the binary?

The full assembly listing below:

   0:   c6 05 c4 87 0b 00 48    movb   $0x48,0xb87c4
   7:   c6 05 c5 87 0b 00 1f    movb   $0x1f,0xb87c5
   e:   c6 05 c6 87 0b 00 65    movb   $0x65,0xb87c6
  15:   c6 05 c7 87 0b 00 1f    movb   $0x1f,0xb87c7
  1c:   c6 05 c8 87 0b 00 6c    movb   $0x6c,0xb87c8
  23:   c6 05 c9 87 0b 00 1f    movb   $0x1f,0xb87c9
  2a:   c6 05 ca 87 0b 00 6c    movb   $0x6c,0xb87ca
  31:   c6 05 cb 87 0b 00 1f    movb   $0x1f,0xb87cb
  38:   c6 05 cc 87 0b 00 6f    movb   $0x6f,0xb87cc
  3f:   c6 05 cd 87 0b 00 1f    movb   $0x1f,0xb87cd
  46:   c6 05 ce 87 0b 00 20    movb   $0x20,0xb87ce
  4d:   c6 05 cf 87 0b 00 1f    movb   $0x1f,0xb87cf
  54:   c6 05 d0 87 0b 00 57    movb   $0x57,0xb87d0
  5b:   c6 05 d1 87 0b 00 1f    movb   $0x1f,0xb87d1
  62:   c6 05 d2 87 0b 00 6f    movb   $0x6f,0xb87d2
  69:   c6 05 d3 87 0b 00 1f    movb   $0x1f,0xb87d3
  70:   c6 05 d4 87 0b 00 72    movb   $0x72,0xb87d4
  77:   c6 05 d5 87 0b 00 1f    movb   $0x1f,0xb87d5
  7e:   c6 05 d6 87 0b 00 6c    movb   $0x6c,0xb87d6
  85:   c6 05 d7 87 0b 00 1f    movb   $0x1f,0xb87d7
  8c:   c6 05 d8 87 0b 00 64    movb   $0x64,0xb87d8
  93:   c6 05 d9 87 0b 00 1f    movb   $0x1f,0xb87d9
  9a:   c6 05 da 87 0b 00 21    movb   $0x21,0xb87da
  a1:   c6 05 db 87 0b 00 1f    movb   $0x1f,0xb87db
  a8:   90                      nop
  a9:   90                      nop
  aa:   90                      nop
  ab:   90                      nop
  ac:   90                      nop
  ad:   90                      nop
  ae:   90                      nop
  af:   90                      nop
  b0:   eb fe                   jmp    0xb0
  b2:   66 90                   xchg   %ax,%ax
  b4:   66 90                   xchg   %ax,%ax
  b6:   66 90                   xchg   %ax,%ax
  b8:   66 90                   xchg   %ax,%ax
  ba:   66 90                   xchg   %ax,%ax
  bc:   66 90                   xchg   %ax,%ax
  be:   66 90                   xchg   %ax,%ax
  c0:   eb fe                   jmp    0xc0
  c2:   66 90                   xchg   %ax,%ax

I don't know Rust, but the second infinite loop in the disassembly may be the second infinite loop in your source code at the end. Giving loop branch targets 16 byte alignment is a very common performance optimization, although obviously the performance of an infinite loop likely isn't going to matter. — Ross Ridge, Jul 17 '17 at 04:53
Try to add `-C llvm-args=-align-all-blocks=1` to `rustc` options. — red75prime, Jul 17 '17 at 07:18
Code for `pub extern panic_fmt()` is included into the binary probably because you declared it as an exported public function or because you [didn't declared `panic_fmt` correcly](https://doc.rust-lang.org/core/#how-to-use-the-core-library). I cannot build your code at the moment, so I can't verify this. — red75prime, Jul 17 '17 at 08:12
Are you sure you are not sweating the small stuff? 26 bytes here may be 13% of the entire footprint, but that is unlikely to scale for non-trivial applications - that is it will be much less than 13%. What is "embedded-like"? Not all embedded systems are resource constrained; if targeting i586 (with typically large SDRAM) is the byte alignment really going to be a significant issue in a non-trivial example? — Clifford, Jul 17 '17 at 09:29
@Clifford I'd even say that the question should have been at least three — "why is this alignment here", "how do I remove the alignment", "why is this other code included". I'd have expected a bit better from a 25K+ rep user :-(. — Shepmaster, Jul 17 '17 at 17:45
@Clifford it's a lot easier to determine where space goes at this kind of level in a small program than in a big program. Also I agree with the title change, and high rep users are mostly just people who've used it since the early days. I stopped daily browsing of SO when nitpicking over format and rules was more common than simply answering questions, even if they bend the rules a bit. Some of the best questions on SO are not ones you could ask today with it's current user base — Earlz, Jul 17 '17 at 23:35

score 6 · Accepted Answer · edited Jul 17 '17 at 12:18

6

As Ross states, aligning functions and branch points to 16 bytes is a common x86 optimization recommended by Intel, although it can occasionally be less efficient, such as in your case. For a compiler to optimally decide whether or not to align is a hard problem, and I believe LLVM simply opts to always align. See more info on Performance optimisations of x86-64 assembly - Alignment and branch prediction.

As red75prime's comment hints (but doesn't explain), LLVM uses the value of the align-all-blocks as the byte alignment for branch points, so setting it to 1 will disable alignment. Note that this applies globally, and that comparison benchmarks are recommended.

edited Jul 17 '17 at 12:18

Shepmaster

388,571
95
1,107
1,366

answered Jul 17 '17 at 07:40

bug

143
1
9

2

Coming back to this a year later... `align-all-functions=1` actually will actually align all functions on a 2 byte border. Meanwhile, `align-all-functions=0` will use platform defaults (only align certain functions, but align those to a 16 or 32 byte boundary). For my use case size is significantly more important than performance – Earlz Sep 18 '19 at 23:04

How can I prevent functions from being aligned to 16 bytes boundary when compiling for X86?

1 Answers1