Does GCC optimize assembly source file?

Question

I can use GCC to convert assembly code files into reallocatable files.

gcc -c source.S -o object.o -O2

Is the optimization option effective? Can I expect GCC to optimize my assembly code?

What happened when you tried? When you examined the input and output? — old_timer, Nov 04 '20 at 00:00
Yes I do noticed some small changes between the source and disassembly. But I'm not sure if it's for optimization purpose. — willswordpath, Nov 04 '20 at 00:56
For example, source `lgdt gdtdesc \n movl %cr0,%eax \n orl $0x1,%eax` were translated into five instructions `lgdtl (%esi) \n insb (%dx),%es:(%edi) \n jl \n and %al,%al \n or $0x1,%ax` — willswordpath, Nov 04 '20 at 01:05
@old_timer After a moment I thought it could be a disassembler's display issue that both code have the same binary expression. The disas instructions' binary: 0f 01 16 6c 7c 0f 20 c0 66 83 c8 01 — willswordpath, Nov 04 '20 at 01:11
and when you used as instead of gcc you saw something different? — old_timer, Nov 04 '20 at 01:11
no gcc doesnt optimize asm, ld might but you have to prep the objects right — old_timer, Nov 04 '20 at 01:15
Now I'm pretty sure it's disas display issue. I used -O0 and it gave me the same result. — willswordpath, Nov 04 '20 at 01:34
Not gcc per se, but there is no guarantee that a next generation link-time-optimiser; or more correctly an inter-module-optimiser, mightn't have a peek at your code and fix it for you. — mevets, Nov 04 '20 at 02:41

score 5 · Accepted Answer · answered Nov 03 '20 at 21:00

5

No.

GCC passes your assembly source through the preprocessor and then to the assembler. At no time are any optimisations performed.

answered Nov 03 '20 at 21:00

fuz

88,405
25
200
352

1

How about link-time optimizations? (not in OP's case, but in general) – Eugene Sh. Nov 03 '20 at 21:02
@EugeneSh. What do you mean by that? – Konrad Rudolph Nov 03 '20 at 21:03
4

@EugeneSh.: I would expect most, if not all, link-time optimizations to rely on information passed by the compiler (through special data structures in the object files or in files associated with them, similar to debugging information) or particular structures or settings in or with the assembly code that indicate certain optimizations, like identifying potentially separable subsections by global symbols, are available. – Eric Postpischil Nov 03 '20 at 21:05
@EricPostpischil That makes sense. – Eugene Sh. Nov 03 '20 at 21:06
I have some very vague inkling of a recollection of an assembly language front end to GCC or Clang, so that it would read assembly like any other programming language, build its usual internal semantic tree and whatnot, optimize it, and generate new assembly. But I do not recall any details, not even whether that was an idea or somebody was actually doing something about it or it was already done somewhere. (Hmm, one could identify the inputs to a routine by which registers were used without evident initialization. But outputs could not be properly determined automatically.) – Eric Postpischil Nov 03 '20 at 21:08
That's disappointing. I thought there are still a lot if optimizations could be performed in the assembling stage, like instruction level optimizations and call inlining etc. – willswordpath Nov 03 '20 at 21:17
1

@willswordpath Actually it sounds like a good thing. You can write things that you don't want to be optimized in assembly. I mean what is the other reason to write in assembly? – Eugene Sh. Nov 03 '20 at 21:19
@EugeneSh. They can always add an volatile option something like -fasm-volatile i guess? There must be someone already handled on assembling stage optimization somewhere, since this is a huge aspect of software industries. – willswordpath Nov 03 '20 at 21:32
@willswordpath: Most code is not written by hand in asm. If you want optimized asm, optimize it yourself in ways that automatic tools aren't smart enough for, or let compilers generate it from higher-level source code. But there are binary-to-binary optimizers that could I guess be useful if you had code produced by a bad old compiler that didn't know about modern CPUs. (machine code and asm are easy to translate between). See [Are there any ASM compilers?](https://stackoverflow.com/q/4394609). I have heard of binary to binary optimizers. But that's not what most people want GAS to do. – Peter Cordes Nov 04 '20 at 00:28
1

@willswordpath Historically, there have been toolchains that would optimise assembly code. One of them is the Plan 9 toolchain which survives in rudiments as a part of the Go toolchain. The assembly-optimising parts have been cut out though; it's not a fashionable approach anymore. This is because it's a lot easier to optimise the code before it is turned into assembly and very few people write assembly anyway. – fuz Nov 04 '20 at 10:17

Peter Cordes · Answer 2 · 2022-05-31T05:37:28.913

If you don't want to hand-optimize your asm, assembly language is the wrong choice of source language for you. Perhaps consider LLVM-IR if you want something asm-like but which is actually input for an optimizing compiler. (And ISA-independent.)

To be fair, there are some binary-to-binary recompilers / optimizers that try to figure out what's implementation detail and what's important logic, and optimize accordingly. (Reading from asm source instead of machine code would also be possible; asm and machine code are easy to convert back and forth and have a nearly 1:1 mapping). But that's not what assemblers do.

An assembler's job is normally just to faithfully translate what you write into asm. Having a tool to do that is necessary for experimenting to find out what actually is faster, without the annoyance of writing actual machine code by hand.

Interestingly GAS, the GNU assembler does have some limited optimization options for x86 that aren't enabled by the GCC front-end, even if your run gcc -O2. (You can run gcc -v ... to see how the front-end invokes other programs to do the real work, with what options.)

Use gcc -Wa,-Os -O3 foo.c bar.S to enable full optimization of your C, and GAS's minor peephole optimizations for your asm. (Or -Wa,-O2, unfortunately the manual is wrong and -Os misses some of the optimizations from -O2) -Wa,... passes ... on the as command line, just like -Wl,... passes linker options through the GCC front-end.

GCC doesn't normally enable as's optimizations because it normally feeds GAS already-optimized asm.

GAS's optimizations are only for single instructions in isolation, and thus only when an instruction can be replaced by another that has exactly the same architectural effect (except for length, so the effect on RIP differs). The micro-architectural effect (performance) can also be different; that's the point of the non-size optimizations.

From the as(1) man page, so note that these are as options, not gcc options.

-O0 | -O | -O1 | -O2 | -Os

Optimize instruction encoding with smaller instruction size. -O and -O1 encode 64-bit register load instructions with 64-bit immediate as 32-bit register load instructions with 31-bit or 32-bits immediates, encode 64-bit register clearing instructions
with 32-bit register clearing instructions, encode 256-bit/512-bit VEX/EVEX vector register clearing instructions with 128-bit VEX vector register clearing instructions, encode 128-bit/256-bit EVEX vector register load/store instructions with VEX vector register load/store instructions, and encode 128-bit/256-bit EVEX packed integer logical instructions with 128-bit/256-bit VEX packed integer logical.

-O2 includes -O1 optimization plus encodes 256-bit/512-bit EVEX vector register clearing instructions with 128-bit EVEX vector register clearing instructions. In 64-bit mode VEX encoded instructions with commutative source operands will also have their source operands swapped if this allows using the 2-byte VEX prefix form instead of the 3-byte one. Certain forms of AND as well as OR with the same (register) operand specified twice will also be changed to TEST.

-Os includes -O2 optimization plus encodes 16-bit, 32-bit and
64-bit register tests with immediate as 8-bit register test with
immediate. -O0 turns off this optimization.

(re: some of those VEX / EVEX operand-size and code-size optimizations: Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm? and the section near the end of my answer on How to tell the length of an x86 instruction? re: 2 vs. 3-byte VEX prefixes)

Unfortunately -O2 and -Os conflict and -Os doesn't actually include everything from -O2. You can't get it to optimize test [re]dx, 1 to test dl,1 (-Os) and optimize or al,al to test al,al (-O2).

But it's still more optimization than NASM does. (NASM's optimization is on by default, except in ancient versions; GAS's is off by default except for picking the shortest encoding without changing the mnemonic or operand names.)

test r/m32, imm8 is not encodeable so the edx version needs an imm32.
or al,al is an obsolete 8080 idiom that's not useful for x86, except sometimes on P6-family to avoid register-read stalls where intentionally re-writing the register is actually better than avoiding lengthening the dep chain.

.intel_syntax noprefix

shufps xmm0, xmm0, 0
vxorps zmm31, zmm31, zmm31
vxorps zmm1, zmm1, zmm1
vxorps ymm15, ymm15, ymm15
vpxord zmm15, zmm15, zmm15

vpxord ymm3, ymm14, ymm15
vpxord ymm3, ymm4, ymm15
vmovd  xmm16, [rdi + 256]    # can use EVEX scaled disp8
vmovd  xmm0, [rdi + 256]     # could use EVEX scaled disp8 but doesn't even with a -march enabling AVX512

xor  rax, rax
or  al,al
cmp dl, 0
test rdx, 1

mov  rax, 1
mov  rax, -1
mov  rax, 0xffffffff80000000

.att_syntax
movabs $-1, %rax
movq   $1, %rax
movabs $1, %rax

Assembled with gcc -g -Wa,-msse2avx -Wa,-O2 -Wa,-march=znver2+avx512dq+avx512vl -c foo.s (For some insane reason, as has -march= support for modern AMD CPU names, but for Intel only up to corei7 and some Xeon Phi, not Skylake-avx512 like GCC does. So I had to enable AVX512 manually to test that.

objdump -dwrC -Mintel -S source + disassembly

0000000000000000 <.text>:
.intel_syntax noprefix

shufps xmm0, xmm0, 0                  # -msse2avx just for fun
   0: c5 f8 c6 c0 00        vshufps xmm0,xmm0,xmm0,0x0
vxorps zmm31, zmm31, zmm31            # avoids triggering AVX512 frequency limit
   5: 62 01 04 00 57 ff     vxorps xmm31,xmm31,xmm31
vxorps zmm1, zmm1, zmm1               # shorter, using VEX
   b: c5 f0 57 c9           vxorps xmm1,xmm1,xmm1
vxorps ymm15, ymm15, ymm15            # missed optimization, could vxorps xmm15, xmm0, xmm0 for a 2-byte VEX and still be a zeroing idiom
   f: c4 41 00 57 ff        vxorps xmm15,xmm15,xmm15
vpxord zmm15, zmm15, zmm15            # AVX512 mnemonic optimized to AVX1, same missed opt for source operands.
  14: c4 41 01 ef ff        vpxor  xmm15,xmm15,xmm15

vpxord ymm3, ymm14, ymm15             # no optimization possible
  19: c4 c1 0d ef df        vpxor  ymm3,ymm14,ymm15
vpxord ymm3, ymm4, ymm15              # reversed operands to allow 2-byte VEX
  1e: c5 85 ef dc           vpxor  ymm3,ymm15,ymm4
vmovd  xmm16, [rdi + 256]         # uses EVEX scaled disp8 because xmm16 requires EVEX anyway
  22: 62 e1 7d 08 6e 47 40  vmovd  xmm16,DWORD PTR [rdi+0x100]
vmovd  xmm0, [rdi + 256]          # could use EVEX scaled disp8 but doesn't even with a -march enabling AVX512
  29: c5 f9 6e 87 00 01 00 00  vmovd  xmm0,DWORD PTR [rdi+0x100]

xor  rax, rax                     # dropped REX prefix
  31: 31 c0                 xor    eax,eax
or  al,al
  33: 84 c0                 test   al,al
cmp dl, 0                         # optimization to test dl,dl not quite legal: different effect on AF
  35: 80 fa 00              cmp    dl,0x0
test rdx, 1                       # partial optimization: only to 32-bit, not 8-bit
  38: f7 c2 01 00 00 00     test   edx,0x1

mov  rax, 1
  3e: b8 01 00 00 00        mov    eax,0x1
mov  rax, -1                         # sign-extension required
  43: 48 c7 c0 ff ff ff ff  mov    rax,0xffffffffffffffff
mov  rax, 0xffffffff80000000
  4a: 48 c7 c0 00 00 00 80  mov    rax,0xffffffff80000000

.att_syntax
movabs $-1, %rax                    # movabs forces imm64, despite -O2
  51: 48 b8 ff ff ff ff ff ff ff ff    movabs rax,0xffffffffffffffff
movq   $1, %rax                     # but explicit q operand size doesn't stop opt
  5b: b8 01 00 00 00        mov    eax,0x1
movabs $1, %rax
  60: 48 b8 01 00 00 00 00 00 00 00    movabs rax,0x1

So unfortunately even explicitly enabling AVX512VL and AVX512DQ didn't get GAS to choose a shorter EVEX encoding for vmovd when an EVEX wasn't already necessary. That's perhaps still intentional: you might want some functions to use AVX512, some to avoid it. If you're using ISA-option limits to catch accidental use of ISA extensions, you would have to enable AVX512 for the whole of such a file. It might be surprising to find the assembler using EVEX where you weren't expecting.

You can manually force it with {evex} vmovd xmm0, [rdi + 256]. (Which unfortunately GCC doesn't do when compiling C, where -march=skylake-avx512 definitely does give it free reign to use AVX512 instructions everywhere.)

score 2 · Answer 3 · edited Nov 21 '20 at 21:39

so.s

#define HELLO 0x5
mov $HELLO, %eax
mov $0x5,%eax
mov $0x5,%eax
mov $0x5,%eax
retq

gcc -O2 -c so.s -o so.o
objdump -d so.o

0000000000000000 <.text>:
   0:   b8 00 00 00 00          mov    $0x0,%eax
   5:   b8 05 00 00 00          mov    $0x5,%eax
   a:   b8 05 00 00 00          mov    $0x5,%eax
   f:   b8 05 00 00 00          mov    $0x5,%eax
  14:   c3                      retq

It didnt even pre-process the define.

rename so.s to so.S

gcc -O2 -c so.S -o so.o
objdump -d so.o

0000000000000000 <.text>:
   0:   b8 05 00 00 00          mov    $0x5,%eax
   5:   b8 05 00 00 00          mov    $0x5,%eax
   a:   b8 05 00 00 00          mov    $0x5,%eax
   f:   b8 05 00 00 00          mov    $0x5,%eax
  14:   c3                      retq

It pre-processes the define but no optimization is occurring.

Looking slightly deeper and what is being passed to as

gcc -O2 -c -save-temps so.s -o so.o
[0][as]
[1][--64]
[2][-o]
[3][so.o]
[4][so.s]

cat so.s

#define HELLO 0x5
mov $HELLO, %eax
mov $0x5,%eax
mov $0x5,%eax
mov $0x5,%eax
retq

And

gcc -O2 -c -save-temps so.S -o so.o
[0][as]
[1][--64]
[2][-o]
[3][so.o]
[4][so.s]

cat so.s
# 1 "so.S"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "<command-line>" 2
# 1 "so.S"


mov $0x5, %eax
mov $0x5,%eax
mov $0x5,%eax
mov $0x5,%eax
retq

still no optimization.

Should be more than enough to demonstrate. There are link time optimizations that you can do you have to build the objects right and then tell the linker. But I suspect it doesn't do it at a machine code level but a high level and re-generates code.

int main ( void )
{
    return(5);
}
gcc -O2 so.c -save-temps -o so.o
cat so.s

    .file   "so.c"
    .section    .text.unlikely,"ax",@progbits
.LCOLDB0:
    .section    .text.startup,"ax",@progbits
.LHOTB0:
    .p2align 4,,15
    .globl  main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    movl    $5, %eax
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
    .section    .text.unlikely
.LCOLDE0:
    .section    .text.startup
.LHOTE0:
    .ident  "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609"
    .section    .note.GNU-stack,"",@progbits

Using the so.S from above

gcc -flto -O2 so.S -save-temps -o so.o
cat so.s

# 1 "so.S"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "<command-line>" 2
# 1 "so.S"


mov $0x5, %eax
mov $0x5,%eax
mov $0x5,%eax
mov $0x5,%eax
retq

Using the so.c from above

gcc -flto -O2 so.c -save-temps -o so.o
cat so.s

.file   "so.c"
.section    .gnu.lto_.profile.3f5dbe2a70110b8,"e",@progbits
.string "x\234ca`d`a`"
.string "\222L\214"
.string ""
.string "o"
.ascii  "\016"
.text
.section    .gnu.lto_.icf.3f5dbe2a70110b8,"e",@progbits
.string "x\234ca`d"
.string "\001\016\006\004`d\330|\356\347Nv\006"
.ascii  "\017\243\003I"
.text
.section    .gnu.lto_.jmpfuncs.3f5dbe2a70110b8,"e",@progbits
.string "x\234ca`d"
.string "\001V\006\004"
.string "\213"
.string ""
.string ""
.string "\356"
.ascii  "\f"
.text
.section    .gnu.lto_.inline.3f5dbe2a70110b8,"e",@progbits
.string "x\234ca`d"
.string "\001\021\006\004"
.string "\21203120\001\231l\013\344\231\300b"
.string "\n\031"
.ascii  "\352"
.text
.section    .gnu.lto_.pureconst.3f5dbe2a70110b8,"e",@progbits
.string "x\234ca`d`f`"
.string "\222\f"
.string ""
.string "X"
.ascii  "\n"
.text
.section    .gnu.lto_main.3f5dbe2a70110b8,"e",@progbits
.ascii  "x\234\035\216\273\016\001a\020\205\347\314\277\313\026\210\236"
.ascii  "B\253\3610^\301\003(<\300\376\330B\024\262\005\211\210r\223-"
.ascii  "\334[\3256\n\005\2117\020\n\211NH(\0043&9\2319\231o.\016\201"
.ascii  "4f\242\264\250 \202!p\270'jz\fha=\220\317\360\361bkp\b\226c\363"
.ascii  "\344\216`\216\330\333nt\316\251\005Jb/Qo\210rl%\216\233\276\327"
.ascii  "\r\3211L-\201\247(b\202\242^\230\241L\302\236V\237A6\025([RD"
.ascii  ":s\244\364\243E5\261\337o\333&q\336e\242\273H\037y0k6W\264\362"
.ascii  "\272`\033\255\337\031\275\315p\261\370\357\026\026\312\310\204"
.ascii  "\333\250Wj\364\003\t\210<\r"
.text
.section    .gnu.lto_.symbol_nodes.3f5dbe2a70110b8,"e",@progbits
.string "x\234ca`d\020f"
.string "\002&\206z\006\206\t\347\030@\324\256\206@\240\b"
.ascii  "'\370\004\002"
.text
.section    .gnu.lto_.refs.3f5dbe2a70110b8,"e",@progbits
.string "x\234ca`\004B "
.string ""
.string ""
.string "9"
.ascii  "\007"
.text
.section    .gnu.lto_.decls.3f5dbe2a70110b8,"e",@progbits
.string "x\234\205PMK\002Q\024\275\347\315h\222\021R-\\\270\020\027\355\222\244\020\367A\355b6A\264\013\261p\221AmZ^\377\200DB\340N\004)\320j~A\bA\021\371\007J!\241e\277@\b\354\276y3\216\320\242\013\367\343\335w\3369\367]\233@\332\372\222V%\357\213O\304\224\344\003\nM\243\\\372k\272g\211/\211\257\210;\377\340\331\302w{\370\025\031\340\035\242\201D\202\022\004xC\350\344\225\306\275\243\024\312\213\024\266\020"
.ascii  "\375\263\nJ_\332\300u\317\344I`\001\211O\345\253i\006\302tB\363"
.ascii  "\b\360X\303\247Se\005\337h\226\330\260\316\360\032q\177\023A"
.ascii  "\224\337\337<\266\027\207\370\2502s\223\331\301T\322[#Q\224\331"
.ascii  "\326\373\204\2058\321\302S\203\235+\301\266\270\247\367%\004"
.ascii  "\215\376[\335\262\226\241\353\317\361\355v\266+\327|\311\254"
.ascii  "\n\341\216;?\265\227x\362Z\337\214\252\234\006\234yl\244\260"
.ascii  "\236\022\261\007$%\036\331\0069~\346V4\323d\327\345Q\375U\325"
.ascii  "\270\247GS\032\205;\031\342\036Y=\241\224\022\273\030\002\035"
.ascii  "\fd`\027\031\232\273(\344\327\362\233\024;.UJg\345\"\331'\207"
.ascii  "\345Jlgw/\275\225\313Q\344\3744[\244_\320\267k~"
.text
.section    .gnu.lto_.symtab.3f5dbe2a70110b8,"e",@progbits
.string "main"
.string ""
.string ""
.string ""
.string ""
.string ""
.string ""
.string ""
.string ""
.string ""
.string ""
.string ""
.string "\260"
.string ""
.string ""
.text
.section    .gnu.lto_.opts,"e",@progbits
.string "'-fmath-errno' '-fsigned-zeros' '-ftrapping-math' '-fno-trapv' '-fno-openmp' '-fno-openacc' '-mtune=generic' '-march=x86-64' '-O2' '-flto' '-fstack-protector-strong'"
.text
.comm   __gnu_lto_v1,1,1
.comm   __gnu_lto_slim,1,1
.ident  "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609"
.section    .note.GNU-stack,"",@progbits

So it still does not appear that gcc is doing any optimization removing these duplicate instructions that have no functional advantage and are basically dead code. It does show that gcc will pre-process the code if the file has the .S but not if .s (can experiment or read the docs on others .asm?). These were run on linux, gcc is gcc, binutils is binutils, the specific file names extension sensitivity may vary by target host.

The link time optimization appears to be related to the high level code as one would expect not the assembly language code. One expects the link time optimization to be based on the middle end code not back end.

We know that gcc is not an assembler it just passes it on even if it is generated from C it passes it on so it would need an assembler parser and then logic to deal with that language to then pick out things to pass on for link time optimization.

You can read more on link time optimization and see if there is a way to apply it to the assembler... I would assume not but your entire question is about how to use the tools and how they work.

Assembly language optimization isn't necessarily a thing, that is kind of the point, now there are pseudo code things for pseudo instructions that the assembler may choose an optimized implementation

ldr r0,=0x12345678
ldr r0,=0x1000
ldr r0,=0xFFFFFF12

00000000 <.text>:
   0:   e59f0004    ldr r0, [pc, #4]    ; c <.text+0xc>
   4:   e3a00a01    mov r0, #4096   ; 0x1000
   8:   e3e000ed    mvn r0, #237    ; 0xed
   c:   12345678    .word   0x12345678

But that is pseudo code so the assembler that supports it is free to do whatever they want. (assemblers define the assembly language (not the target) so by definition they get to do whatever they want). On that note using a compiler as an assembler when the toolchain also has an assembler changes it into yet another assembly language as assembly language is defined by the tool. So when you allow gcc to pre-process the code you are basically using a different assembly language from as. Just like inline assembly for the compiler is yet another assembly language. At least three assembly languages per target for the gnu toolchain.

Does GCC optimize assembly source file?

3 Answers3

Linked