gcc: strange asm generated for simple loop

Question

m68k-linux-gnu-gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CFLAGS = -Wall -Werror -ffreestanding -nostdlib -O2 -m68000 -mshort

I am very confused why gcc generates such (seemingly) non-optimal code for a simple for loop over a const array.

const unsigned int pallet[16] = {
  0x0000,
  0x00E0,
  0x000E,
  ...
  0x0000
};

...

volatile unsigned long * const VDP_DATA = (unsigned long *) 0x00C00000;

...

for(int i = 0; i < 16; i++) {
  *VDP_DATA = pallet[i];
}

Results in:

 296:   41f9 0000 037e  lea 37e <pallet+0x2>,%a0
 29c:   223c 0000 039c  movel #924,%d1
 2a2:   4240            clrw %d0
 2a4:   0280 0000 ffff  andil #65535,%d0
 2aa:   23c0 00c0 0000  movel %d0,c00000 <_etext+0xbffc2c>
 2b0:   b288            cmpl %a0,%d1
 2b2:   6712            beqs 2c6 <main+0x46>
 2b4:   3018            movew %a0@+,%d0
 2b6:   0280 0000 ffff  andil #65535,%d0
 2bc:   23c0 00c0 0000  movel %d0,c00000 <_etext+0xbffc2c>
 2c2:   b288            cmpl %a0,%d1
 2c4:   66ee            bnes 2b4 <main+0x34>

My main concern:

Why the useless first element compare at 2b0? This will never hit and never gets branched back to. It just ends up being duplicate code all for the first iteration.

Is there a better way to write this dead-simple loop such that gcc wont produce this strange code?
Are there any compiler flags/optimizations I can take advantage of? O3 simply unrolls the loop, which I don't want either as space is a bigger concern than speed at this part of the code.
Maybe I'm being too scrupulous, but I just figured this wouldn't be the most difficult code to generate. I was expecting something more along the lines of (probably wrong but you get the idea):

lea pallet,%a0
movel #7,%d0
1:
movel %a0@+,c00000
dbra %d0,1

I get that I have to be a bit more explicit in my code to get it to write in long chunks. My main point here is how come gcc can't seem to figure out the my intentions i.e I just want to dump this data in to this address.

Another observation:

clrw %d0 → andil #65535,%d0 → movel %d0,c00000. Why not just clrl and move?

Is `-Os` useful? That optimizes for size (but not at the total expense of speed). Or maybe `-O3 -fno-unroll-loops`? 68k is not widely used so I'm not too surprised there are some facepalm level missed peephole and other optimizations that nobody's noticed until now, unless there's some microarchitectural reason for them with tune=generic caring about some CPU. Did you try with `-march=68020` or whatever? Anyway, I'd guess that GCC devs would appreciate bug reports for missed optimizations (https://gcc.gnu.org/bugzilla/), especially simple peepholes with small reproducible examples. — Peter Cordes, Mar 16 '20 at 03:07
(One missed optimization per bug report, even if they're all from compiling the same code.) — Peter Cordes, Mar 16 '20 at 03:07
@PeterCordes `-Os` seems to produce more reasonable code, so thanks for that! I wasn't aware of that and can surely function attribute that where necessary. Strangely `-O3` ignores `-fno-unroll-loops`... — grep, Mar 16 '20 at 03:19
@PeterCordes However, it appears `optimize("s"), always_inline` doesn't work together. Inlines it but ignores the optimization. Same with `static inline` plus just the optimize attribute.. — grep, Mar 16 '20 at 03:38
Fully unrolling for small constant numbers of iterations might be controlled by a different GCC option. Perhaps `-fno-peel-loops`? Yeah, [the documentation](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-fpeel-loops) for that looks like it would apply to your 0..15 loop, and it's only enabled at `-O3`, not `-O2`. GCC's tuning heuristics can even set thresholds for when to do this: [How to ask GCC to completely unroll this loop (i.e., peel this loop)?](https://stackoverflow.com/q/36110591) — Peter Cordes, Mar 16 '20 at 03:45
Can m68k do memory to memory `movel`? Also, that can't work for your case because you need to zero-extend from 16-bit `unsigned int` to 32-bit `unsigned long`. — Peter Cordes, Mar 16 '20 at 03:49
@PeterCordes `-fno-peel-loops` certainly does the trick! I'm not sure I understand how peeling the loop in this case was ever an optimization? It seemingly results in unused code. I'm not sure about the memory to memory `movel` - this is my first time using both C and any sort of ASM. The opcode docs for `movel` are not obvious to me. — grep, Mar 16 '20 at 03:55
Didn't fully unrolling (aka peeling) result in fewer total instructions *executed*? Just construct each constant value in a register and `movel %d0,c00000`, with no cmp / branch overhead. If this was inside a hot loop that fit in the instruction cache, it could be a win. re: m68k move memory to memory: yes, unlike most ISAs m68k can copy memory to memory with one `move` instruction. http://mrjester.hapisan.com/04_MC68/Sect01Part05/Index.html. The only roadblock is needing 16-bit loads / 32-bit stores; you could try with an `unsigned long` palette array to see what gcc does. — Peter Cordes, Mar 16 '20 at 04:06
@PeterCordes Ah, I thought peeling referred to what it was doing above i.e doing one or more iterations first before looping. I really appreciate all the information - Ive learned a lot. Another thing I just did was union it with an unsigned long array half the size and it significantly cleaned up the code generation. If you'd like to compile these suggestions in to an answer I'd love to accept it. — grep, Mar 16 '20 at 04:14
Ah yes, I would call that peeling the *first* iteration where the value to be stored is a simple `0`, as opposed to fully unrolling the loop. I'm not sure if there's a separate option for that. re: *an unsigned long array half the size* huh, but that would store different things! You'd be storing 2 packed 16-bit integers per store to `*VDP_DATA`, rather than one zero-extended 16-bit integer. Or did you mean something different that just happened to hand-hold GCC into making a cleaner zero-extension loop, maybe by stopping it from seeing through the compile-time-constness?/ — Peter Cordes, Mar 16 '20 at 04:20
Oh, and BTW, m68000 doesn't have any I-cache so small code footprint is only valuable for fitting code into RAM limits, and `-O3` doesn't care about that. 68010 has a loop buffer for tiny 2 isns loops (https://mikro.naprvyraz.sk/docs/Optimize/68OPT.TXT and wikipedia), so optimizing for that can be a win, but you used `gcc -m68000`. 68020 has a 256B instruction cache, so again small code size can let a large hot loop run from cache. For your case, consider `-fprofile-generate` / `-fprofile-use` profile-guided optimization might let gcc be smarter about that when it knows if your code is hot. — Peter Cordes, Mar 16 '20 at 04:27
@PeterCordes Per the union, I just found out the port I'm writing to accepts the data as long in this scenario. Per the code footprint and architecture - I'm working on a Sega Genesis game so the size consideration is because of the rom size limit. I'll look at those profiling flags! — grep, Mar 16 '20 at 04:35
Ok yes, then a union is probably good. The other option would be 4-byte `memcpy(VDP_DATA, ptr, 4)` which the compiler can inline to a single `move.l` — Peter Cordes, Mar 16 '20 at 04:40
Could you elaborate on how `VDP_DATA` behaves on your hardware? Does it make a difference if you move a long at once, or clear the first two bytes and then move the lower word to `0x00c00002`? If the latter works, do you need to clear the upper word every time? — chtz, Mar 18 '20 at 13:09

score 1 · Answer 1 · answered Mar 17 '20 at 07:26

I've been playing with GCC and 68k code generation and I've found that it merely can't generate decent code for 68k family any more, particularly not for 68000.

The code is barely correct, however not optimized (or should I say, it seems to be DE-optimized?). You should first try to use -Os instead of -O2. Even then you'll encounter lots of useless insns in the generated code.

My speculation is that while the actual architectures support in GCC quickly moves forward, backend for 68k is not properly maintained, being simply kept correct with minimal effort.

gcc: strange asm generated for simple loop

1 Answers1

Linked