8

I have stumbled upon the following problem. The below code snippet does not link on Mac OS X with any Xcode I tried (4.4, 4.5)

#include <stdlib.h>
#include <string.h>
#include <emmintrin.h>

int main(int argc, char *argv[])
{
  char *temp;
#pragma omp parallel
  {
    __m128d v_a, v_ar;
    memcpy(temp, argv[0], 10);
    v_ar = _mm_shuffle_pd(v_a, v_a, _MM_SHUFFLE2 (0,1));
  }
}

The code is just provided as an example and would segfault when you run it. The point is that it does not compile. The compilation is done using the following line

/Applications/Xcode.app/Contents/Developer/usr/bin/gcc test.c -arch x86_64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.7.sdk -mmacosx-version-min=10.7 -fopenmp

 Undefined symbols for architecture x86_64:
"___builtin_ia32_shufpd", referenced from:
    _main.omp_fn.0 in ccJM7RAw.o
"___builtin_object_size", referenced from:
    _main.omp_fn.0 in ccJM7RAw.o
ld: symbol(s) not found for architecture x86_64
collect2: ld returned 1 exit status

The code compiles just fine when not using the -fopenmp flag to gcc. Now, I googled around and found a solution for the first problem connected with memcpy, which is adding -fno-builtin, or -D_FORTIFY_SOURCE=0 to gcc arguments list. I did not manage to solve the second problem (sse intrinsic).

Can anyone help me to solve this? The questions:

  • most importantly: how to get rid of the "___builtin_ia32_shufpd" error?
  • what exactly is the reason for the memcpy problem, and what does the -D_FORTIFY_SOURCE=0 flag eventually do?
jww
  • 97,681
  • 90
  • 411
  • 885
angainor
  • 11,760
  • 2
  • 36
  • 56
  • compiles fine for me (OSX10.8.2, Xcode 4.5, macports gcc 4.7.1) when using -fopenmp -O1 (or higher, but not -O0: this gives linker error with missing `___gxx_personality_v0`). However code produces segfault when run. When compiling without -fopenmp, code compiles for any -O, but again segfaults (except for -O0: bus error). – Walter Oct 17 '12 at 10:34
  • @Walter Thanks. segfault is not a problem, the code is just an example - of course it is wrong. You are using gcc 4.7.1, so not Xcode compilers, right? Could you compile with the commandline I gave? Changing of the optimization level did not help here.. – angainor Oct 17 '12 at 10:37
  • 1
    This is a bug in the `llvm-gcc` compiler that Xcode ships with. It is an LLVM compiler with GCC frontend. The OpenMP phase is generating some builtins that the backend is not able to recognise. As Xcode is steadily moving towards fully replacing GCC with `clang`, the bug will probably never get fixed. Just install the real GCC from source or via some other method and use it to compile OpenMP codes. – Hristo Iliev Oct 18 '12 at 11:31
  • @HristoIliev That might well be what I will do. But the question is then why the same builtins work with no OpenMP? Seems like `gcc` is only used for OpenMP code. Also, I do not use the builtins directly in my code. I use `memcpy` and `_mm_shuffle_pd`. Those should be supported by `clang` i guess. Anyway, thanks for your reply. – angainor Oct 18 '12 at 11:38
  • Clang does not yet have OpenMP support. `_mm_suhffle_pd` is implemented as an inline function that calls `__builtin_ia32_shufpd`. The other builtin comes from `FORTIFY_SOURCE` being enabled on most Unixes. It replaces certain unsafe functions with more safe, checking variants, either implemented as inlines or resulting from tree transformations. – Hristo Iliev Oct 18 '12 at 12:04
  • @HristoIliev Thanks. Why don't you write it up as an answer? – angainor Oct 18 '12 at 12:07
  • I would check something on my Mac later today and will write you a more complete answer. – Hristo Iliev Oct 18 '12 at 12:11
  • @Walter Thanks for your suggestion, I have installed macports and got the code to work. – angainor Oct 18 '12 at 19:25

1 Answers1

15

This is a bug in the way Apple's LLVM-backed GCC (llvm-gcc) transforms OpenMP regions and handles calls to the built-ins inside them. The problem can be diagnosed by examining the intermediate tree dumps (obtainable by passing -fdump-tree-all argument to gcc). Without OpenMP enabled the following final code representation is generated (from the test.c.016t.fap):

main (argc, argv)
{
  D.6544 = __builtin_object_size (temp, 0);
  D.6545 = __builtin_object_size (temp, 0);
  D.6547 = __builtin___memcpy_chk (temp, D.6546, 10, D.6545);
  D.6550 = __builtin_ia32_shufpd (v_a, v_a, 1);
}

This is a C-like representation of how the compiler sees the code internally after all transformations. This is what is then gets turned into assembly instructions. (only those lines that refer to the built-ins are shown here)

With OpenMP enabled the parallel region is extracted into own function, main.omp_fn.0:

main.omp_fn.0 (.omp_data_i)
{
  void * (*<T4f6>) (void *, const <unnamed type> *, long unsigned int, long unsigned int) __builtin___memcpy_chk.21;
  long unsigned int (*<T4f5>) (const <unnamed type> *, int) __builtin_object_size.20;
  vector double (*<T6b5>) (vector double, vector double, int) __builtin_ia32_shufpd.23;
  long unsigned int (*<T4f5>) (const <unnamed type> *, int) __builtin_object_size.19;

  __builtin_object_size.19 = __builtin_object_size;
  D.6587 = __builtin_object_size.19 (D.6603, 0);
  __builtin_ia32_shufpd.23 = __builtin_ia32_shufpd;
  D.6593 = __builtin_ia32_shufpd.23 (v_a, v_a, 1);
  __builtin_object_size.20 = __builtin_object_size;
  D.6588 = __builtin_object_size.20 (D.6605, 0);
  __builtin___memcpy_chk.21 = __builtin___memcpy_chk;
  D.6590 = __builtin___memcpy_chk.21 (D.6609, D.6589, 10, D.6588);
}

Again I have only left the code that refers to the builtins. What is apparent (but the reason for that is not immediately apparent to me) is that the OpenMP code trasnformer really insists on calling all the built-ins through function pointers. These pointer asignments:

__builtin_object_size.19 = __builtin_object_size;
__builtin_ia32_shufpd.23 = __builtin_ia32_shufpd;
__builtin_object_size.20 = __builtin_object_size;
__builtin___memcpy_chk.21 = __builtin___memcpy_chk;

generate external references to symbols which are not really symbols but rather names that get special treatment by the compiler. The linker then tries to resolve them but is unable to find any of the __builtin_* names in any of the object files that the code is linked against. This is also observable in the assembly code that one can obtain by passing -S to gcc:

LBB2_1:
    movapd  -48(%rbp), %xmm0
    movl    $1, %eax
    movaps  %xmm0, -80(%rbp)
    movaps  -80(%rbp), %xmm1
    movl    %eax, %edi
    callq   ___builtin_ia32_shufpd
    movapd  %xmm0, -32(%rbp)

This basically is a function call that takes 3 arguments: one integer in %eax and two XMM arguments in %xmm0 and %xmm1, with the result being returned in %xmm0 (as per the SysV AMD64 ABI function calling convention). In contrast, the code generated without -fopenmp is an instruction-level expansion of the intrinsic as it is supposed to happen:

LBB1_3:
    movapd  -64(%rbp), %xmm0
    shufpd  $1, %xmm0, %xmm0
    movapd  %xmm0, -80(%rbp)

What happens when you pass -D_FORTIFY_SOURCE=0 is that memcpy is not replaced by the "fortified" checking version and a regular call to memcpy is used instead. This eliminates the references to object_size and __memcpy_chk but cannot remove the call to the ia32_shufpd built-in.

This is obviously a compiler bug. If you really really really must use Apple's GCC to compile the code, then an interim solution would be to move the offending code to an external function as the bug apparently only affects code that gets extracted from parallel regions:

void func(char *temp, char *argv0)
{
   __m128d v_a, v_ar;
   memcpy(temp, argv0, 10);
   v_ar = _mm_shuffle_pd(v_a, v_a, _MM_SHUFFLE2 (0,1));
}

int main(int argc, char *argv[])
{
  char *temp;
#pragma omp parallel
  {
    func(temp, argv[0]);
  }
}

The overhead of one additional function call is neglegible compared to the overhead of entering and exiting the parallel region. You can use OpenMP pragmas inside func - they will work because of the dynamic scoping of the parallel region.

May be Apple would provide a fixed compiler in the future, may they won't, given their commitment to replacing GCC with Clang.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • Thank you for thoroughly explaining to me that this won't work ;) I gave up on `Xcode` and installed macports gcc 4.7, as previously mentioned by Walter. It worked with no problems. The only annoying thing is that in order to use macports you anyway need to install Xcode **and** command line tools, and you get a new set of compilers on top. just wow. – angainor Oct 18 '12 at 19:24