-1

I have a piece of C code that has an int array - the code makes several reads to the array. When I compile it with GCC to X86 assembly using the -O0 flag, in the assembly all the read accesses to the array are made using the movl instruction - a 32 bit load. This makes sense because ints are 32 bits and so accesses to arrays of them should use 32 bit loads.

However, when I compile it using the -O3 flag, several of the 32 bit movl reads to the array are replaced with 64 bit loads into the XMM registers instead... I assume this is some sort of optimization, but the optimized disassembly is very challenging to decipher and I'm a bit lost about what's going on.

Without going into too much detail about my work, I need to use the O3 flag, but I need all accesses to my 32 bit int array to use 32 bit accesses.

Does anyone have any insight into what could possibly going on and how I can enforce all loads to my array to be 32 bits while still using the -O3 flag?

Example to reproduce:

Here's the C code:

#include <stdlib.h>

int main() {
  int* arr = malloc(sizeof(int) * 64);
  int sum = 0;

  for (int i = 0; i < 10; i++) {
    sum += arr [i] + arr[i+1];
  }

  if (sum == 0)
    return 0;
  else
    return 1;
}

For the unoptimized disassembly, compile with (note the 32 bit loads in the disassembly):

gcc -S -fverbose-asm -o mb64BitLoadsNoOpt.s mb64BitLoads.c

For the optimized disassembly, compile with (note the XMM register 64 bit loads in the disassembly):

gcc -O3 -S -fverbose-asm -o mb64BitLoadsOpt mb64BitLoads.c

Farhad
  • 516
  • 3
  • 14
  • 1
    Include your C or C++ code, and the disassembly of that code, in your question. Use the appropriate language tag - not both. – 1201ProgramAlarm Nov 01 '20 at 19:46
  • Try to use `gcc -Wall -Wextra -O3 -S -fverbose-asm yourcode.c` then examine `yourcode.s`. Notice that C and C++ are different programming languages. See [this reference website](https://en.cppreference.com/w/) and provide some [mre] in your question – Basile Starynkevitch Nov 01 '20 at 19:46
  • 4
    `I need to use the O3 flag, but I need all accesses to my 32 bit int array to use 32 bit accesses.` why do you need that? Does the target hardware not support the instructions generated with O3? – t.niese Nov 01 '20 at 19:49
  • @BasileStarynkevitch I'm already using that but it's still challenging to understand. I've included a reproducible example. – Farhad Nov 01 '20 at 19:57
  • @t.niese It's complicated, I'm essentially doing microarchitecture research and my micro-architectural simulator needs the workload to have a specific form. – Farhad Nov 01 '20 at 19:57
  • 2
    @Phidias it is hard to tell what you want to achive. But you could you compile it with e.g. `-mno-sse2` or use `-march` – t.niese Nov 01 '20 at 20:03
  • You can use `volatile`. (e.g.) `volatile int* arr = malloc(sizeof(int) * 64);` – Craig Estey Nov 01 '20 at 20:04
  • 2
    @t.niese: Or `-fno-tree-vectorize`, which wouldn't stop it from using SSE2 for memcpy, and wouldn't break floating-point code (where SSE2 for XMM registers is part of the ABI / calling convention) – Peter Cordes Nov 02 '20 at 00:06
  • Note that XMM registers are 128 bits wide. When GCC decides to auto-vectorize, it will use `movdqu` 128-bit loads for this, not just 64-bit `movq` except for the last vector. https://godbolt.org/z/8xYdT3. Heh, gcc misses the major algorithmic optimization possible: each element is added twice, except for arr[0] and arr[10]. So you just need to vectorize the sum of the middle 1..9 elements and double that, adding the ends elements. More efficient than gcc's unaligned loads. – Peter Cordes Nov 02 '20 at 07:15
  • @PeterCordes Thanks, that solved my problem. If you post your comment as an answer I'll be happy to accept it. – Farhad Nov 02 '20 at 18:56
  • I don't need to post an answer; the answer has already been written. That's why I linked this question as a duplicate, see the top of the page. – Peter Cordes Nov 02 '20 at 21:25

1 Answers1

0

Without going into too much detail about my work, I need to use the -O3 flag, but I need all accesses to my 32 bit int array to use 32 bit accesses.

This is contradictory. See this reference website, the C11 draft standard n1570, and the C++11 draft standard n3337.

These standards do not require all accesses to your 32 bits array to use 32 bits accesses.

With a recent GCC 10, use gcc -fverbose-asm -O3 -S foo.c then look into foo.s to understand why your compiler is optimizing.

You could try other compilers,

such as Compcert or Clang (or even tinycc, or nwcc, or write your own one starting from Frama-C, or even starting from scratch ...). Ask permission to your boss if their license is suitable for your work.

You could also improve these compilers to your needs, or write your own GCC plugin to suite your needs.

If you insist on using GCC, you could consider using various assembler-related extensions. Of course, read about invoking GCC.

GCC is free software, you are allowed to study its source code then improve it.

Download the source code of GCC. Then compile and install it. Look in particular into its pass manager. You certainly will be able to remove the optimizations which are annoying you. You probably could do that with your GCC plugin. Ask help on and subscribe to the gcc-help@gcc.gnu.org mailing list.

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • Sorry these documents are very long - what should I be looking at in them? And why is it contradictory - can I not turn off certain specific optimizations that -O3 does? – Farhad Nov 01 '20 at 19:52
  • Because the C standard specifies that such optimizations are legal – Basile Starynkevitch Nov 01 '20 at 20:45
  • You certainly can turn off certain optimizations. You just need to download the source code of GCC and dive inside it. – Basile Starynkevitch Nov 01 '20 at 21:16
  • You can download GCC source code from the [listed mirrors](https://gcc.gnu.org/mirrors.html), for example http://mirrors.kernel.org/gnu/gcc/gcc-10.2.0/gcc-10.2.0.tar.xz – Basile Starynkevitch Nov 02 '20 at 06:39
  • 1
    @Phidias: To be fair, there are other optimizations that could happen in other code, e.g. `arr[0] = arr[1] = 0;` could be done with one integer `mov qword ptr [rdi], 0` store. Auto-vectorization is a different optimization from store coalescing. If you just want easy-to-read asm for most normal cases, not any strict correctness requirement for what the compiler might possibly do, then this answer is total overkill. But given the way you posed the question it's not really wrong, just not very helpful. See also [How to remove "noise" from GCC asm output?](//stackoverflow.com/q/38552116) – Peter Cordes Nov 02 '20 at 07:22