42

Question

I am testing a simple code which calculates Mandelbrot fractal. I have been checking its performance depending on the number of iterations in the function that checks if a point belongs to the Mandelbrot set or not. The surprising thing is that I am getting a big difference in times after adding the -fPIC flag. From what I read the overhead is usually negligible and the highest overhead I came across was about 6%. I measured around 30% overhead. Any advice will be appreciated!

Details of my project

I use the -O3 flag, gcc 4.7.2, Ubuntu 12.04.2, x86_64. The results look as follow

    #iter     C (fPIC)  C       C/C(fPIC)
    1         0.01      0.01    1.00 
    100       0.04      0.03    0.75 
    200       0.06      0.04    0.67 
    500       0.15      0.1     0.67 
    1000      0.28      0.19    0.68
    2000      0.56      0.37    0.66 
    4000      1.11      0.72    0.65 
    8000      2.21      1.47    0.67
   16000      4.42      2.88    0.65 
   32000      8.8       5.77    0.66 
   64000      17.6      11.53   0.66

Commands I use:

gcc -O3 -fPIC fractalMain.c fractal.c -o ffpic
gcc -O3 fractalMain.c fractal.c -o f

Code: fractalMain.c

#include <time.h>
#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"

int main()
{
    int iterNumber[] = {1, 100, 200, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000};
    int it;
    for(it = 0; it < 11; ++it)
    {
        clock_t start = clock();
        fractal(iterNumber[it]);
        clock_t end = clock();
        double millis = (end - start)*1000 / CLOCKS_PER_SEC/(double)1000;
        printf("Iter: %d, time: %lf \n", iterNumber[it], millis);
    }
    return 0;
}

Code: fractal.h

#ifndef FRACTAL_H
#define FRACTAL_H
    void fractal(int iter);
#endif

Code: fractal.c

#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"

void multiplyComplex(double a_re, double a_im, double b_re, double b_im, double* res_re, double* res_im)
{
    *res_re = a_re*b_re - a_im*b_im;
    *res_im = a_re*b_im + a_im*b_re;
}

void sqComplex(double a_re, double a_im, double* res_re, double* res_im)
{
    multiplyComplex(a_re, a_im, a_re, a_im, res_re, res_im);
} 

bool isInSet(double P_re, double P_im, double C_re, double C_im, int iter)
{
    double zPrev_re = P_re;
    double zPrev_im = P_im;
    double zNext_re = 0;
    double zNext_im = 0;
    double* p_zNext_re = &zNext_re;
    double* p_zNext_im = &zNext_im;
    int i;  
    for(i = 1; i <= iter; ++i)
    {
        sqComplex(zPrev_re, zPrev_im, p_zNext_re, p_zNext_im);
        zNext_re = zNext_re + C_re;
        zNext_im = zNext_im + C_im;
        if(zNext_re*zNext_re+zNext_im*zNext_im > 4)
        {
            return false;
        }
        zPrev_re = zNext_re;
        zPrev_im = zNext_im;
    }
    return true;
}

bool isMandelbrot(double P_re, double P_im, int iter)
{
    return isInSet(0, 0, P_re, P_im, iter);
}
void fractal(int iter)
{
    int noIterations = iter;
    double xMin = -1.8;
    double xMax = 1.6;
    double yMin = -1.3;
    double yMax = 0.8;
    int xDim = 512;
    int yDim = 384;
    double P_re, P_im;
    int nop;
    int x, y;

    for(x = 0; x < xDim; ++x)
        for(y = 0; y < yDim; ++y)
        {
            P_re = (double)x*(xMax-xMin)/(double)xDim+xMin;
            P_im = (double)y*(yMax-yMin)/(double)yDim+yMin;
            if(isMandelbrot(P_re, P_im, noIterations))
                nop = x+y;
        }
        printf("%d", nop);
}

Story behind the comparison

It might look a bit artificial to add the -fPIC flag when building executable (as per one of the comments). So a few words of explanation: first I only compiled the program as executable and wanted to compare to my Lua code, which calls the isMandelbrot function from C. So I created a shared object to call it from lua - and had big time differences. But couldn't understand why they were growing with number of iterations. In the end found out that it was because of the -fPIC. When I create a little c program which calls my lua script (so effectively I do the same thing, only don't need the .so) - the times are very similar to C (without -fPIC). So I have checked it in a few configurations over the last few days and it consistently shows two sets of very similar results: faster without -fPIC and slower with it.

Shmuel Levine
  • 550
  • 5
  • 18
Ola M
  • 1,298
  • 3
  • 12
  • 27
  • 2
    Cannot reproduce using gcc 4.7.2 on x86_64 (OS/X). – NPE Apr 07 '13 at 11:21
  • You forgot to give the `fractal.h` header – Basile Starynkevitch Apr 07 '13 at 11:25
  • 1
    Do you have anything set in `CFLAGS`? – teppic Apr 07 '13 at 11:26
  • @NPE - meaning you're getting similar times regardless of the flag? – Ola M Apr 07 '13 at 11:26
  • @OlaM: Yes, almost identical timings (e.g. 10.985 vs 10.976). – NPE Apr 07 '13 at 11:27
  • I'm a bit surprised by your results. You're not really testing the `-fPIC` overhead that way. Since you're building an executable, gcc can make some assumptions about the location of the final function call. You need to build part of your application as a shared library, and the overhead will be in the calls that you need to make from your main application into the shared library. Though, in you current split, this also would make an discernable difference. You'd have to put either `multiplyComplex` or `sqComplex` into the shared library, to really start seeing an impact. – John Szakmeister Apr 07 '13 at 11:30
  • 1
    FWIW, I too get similar timings between the runs, just like @OlaM. – John Szakmeister Apr 07 '13 at 11:31
  • @Basile Starynkevitch: added the header to the question. – Ola M Apr 07 '13 at 11:38
  • @teppic: no, checked 'echo $CFLAGS' - it's empty. – Ola M Apr 07 '13 at 11:39
  • 2
    I get: gcc 12.78/18.23 -- clang 13.73 / 13.75 – teppic Apr 07 '13 at 11:41
  • Interesting. I see the difference under Linux too. Making everything in `fractal.c` static except `fractal()` the returns the timings to be the same between implementations. I'm not sure why `gcc` isn't optimizing this better under Linux. – John Szakmeister Apr 07 '13 at 11:49
  • @jszakmeister - I added the explanation of where the problem started to the question - have checked it also building a proper .so. – Ola M Apr 07 '13 at 11:53
  • 4
    Take a look at the `objdump -d` output for each executable. You can see that some optimizations are being excluded in the `-fPIC` version inside the `fractal()` function. What exactly are you trying to measure the performance of here? The overhead of calling `fractal()` when it's compiled with `-fPIC`? Or are you doing the fractal work in Lua and calling out to the lower level routines? One reason why the non-fPIC version is so much better is because a lot of work is being inlined in `fractal()`. In the `-fPIC` version, it's not. Again, you can fix that by making the helpers static. – John Szakmeister Apr 07 '13 at 12:06
  • @jszakmeister: many thanks for the comment. What I am looking for is fair performance comparison between pure C and my lua+C programs. And was wondering if the overhead is not related to my imperfect implementation - which it looks like it is: you're right the helper functions should be static and that it prevents the overhead. Also from the other comments it looks like clang is better at optimizing imperfect code itself. – Ola M Apr 07 '13 at 14:30
  • 2
    Stupid question, but you are quite sure that you are compiling as 64-bit code, right? The IA-32 instruction set was not designed for position-independent code and it would be normal to see this kind of difference if you were mistakenly using it. x86_64 **was** designed to make position-independent code as fast as position-dependent code, and the normal situation is not to have any measurable difference (as NPE and others found) – Pascal Cuoq Apr 07 '13 at 14:53
  • @Pascal Cuoq: yes, Target: x86_64-linux-gnu. And yes, I read too that the difference should be negligible in particular for 64-bit... – Ola M Apr 07 '13 at 15:07
  • 1
    You may also pass `-flto` both at compile and at link time (in addition of `-O3`), and you could also give `-mtune=native` – Basile Starynkevitch Apr 07 '13 at 15:21
  • With gcc-4.8 and GLibc 2.17 on a Linux kernel 3.8.5 (x86-64 i3770K, Debian/Sid/AMD64 with a bit of experimental) I'm getting 27.55s for PIC and 18.24s for nonPIC; I've got no idea why such a big difference... – Basile Starynkevitch Apr 07 '13 at 15:29
  • 3
    However, if compiling with `gcc-4.8 -flto -O3 -mtune=native` with or without `-fPIC` I am getting only 18.28sec for non PIC and 18.36 for PIC. So I advise to use `-flto -mtune=native -O3` with or without `-fPIC` – Basile Starynkevitch Apr 07 '13 at 15:33
  • @NPE: [**OS X requires position-independent code in executables, not just libraries**](https://developer.apple.com/library/mac/documentation/DeveloperTools/Conceptual/MachOTopics/1-Articles/x86_64_code.html). So gcc is forced to enable `-fPIE` for executables, not just libraries. (On x86-64, I don't think there's much if any difference between `-fPIC` and `-fPIE`, but `-fPIE` might be able to take advantage of things that library code couldn't). Linux and Windows don't have this requirement, so it makes sense that if there is a difference, it's not reproducible on OS X. – Peter Cordes Mar 31 '16 at 05:01

3 Answers3

54

It turns out that when you compile without the -fPIC option multiplyComplex, sqComplex, isInSet and isMandelbrot are inlined automatically by the compiler. If you define those functions as static you will likely get the same performance when compiling with -fPIC because the compiler will be free to perform inlining.

The reason why the compiler is unable to automatically inline the helper functions has to do with symbol interposition. Position independent code is required to access all global data indirectly, i.e. through the global offset table. The very same constraint applies to function calls, which have to go through the procedure linkage table. Since a symbol might get interposed by another one at runtime (see LD_PRELOAD), the compiler cannot simply assume that it is safe to inline a function with global visibility.

The very same assumption can be made if you compile without -fPIC, i.e. the compiler can safely assume that a global symbol defined in the executable cannot be interposed because the lookup scope begins with the executable itself which is then followed by all other libraries, including the preloaded ones.

For a more thorough understanding have a look at the following paper.

Community
  • 1
  • 1
redblackbit
  • 816
  • 8
  • 12
15

As other people already pointed out -fPIC forces GCC to disable many optimizations e.g. inlining and cloning. I'd like to point out several ways to overcome this:

  • replace -fPIC with -fPIE if you are compiling main executable (not libraries) as this allows compiler to assume that interposition is not possible;
  • use -fvisibility=hidden and __attribute__((visibility("default"))) to export only necessary functions from the library and hide the rest; this would allow GCC to optimize hidden functions more aggressively;
  • use private symbol aliases (__attribute__((alias ("__f")));) to refer to library functions from within the library; this would again untie GCC's hands
  • previous suggestion can be automated with -fno-semantic-interposition flag that was added in recent GCC versions

It's interesting to note that Clang is different from GCC as it allows all optimizations by default regardless of -fPIC (can be overridden with -fsemantic-interposition to obtain GCC-like behavior).

yugr
  • 19,769
  • 3
  • 51
  • 96
  • By "libraries" do you mean Shared Libraries, or All Libraries (including static libraries)? I thought fpic and fPIC were for shared libraries, and actually slowed things down without decreasing memory use in static libraries? – dstromberg Oct 07 '22 at 15:53
  • @dstromberg `-fPIC` is orthogonal to type of the library. One could choose to compile static library with `-fPIC` in case she wants them to be linkable both to executables and to shared libs. – yugr Oct 08 '22 at 10:37
3

As others have discussed in the comment section of your opening post, compiling with -flto should help to reduce the difference in run-times you are seeing for this particular case, since the link time optimisations of gcc will likely figure out that it's actually ok to inline a couple of functions ;)

In general, link time optimisations could lead to massive reductions in code size (~6%) link to paper on link time optimisations in gold, and thus run time as well (more of your program fits in the cache). Also note that -fPIC is mostly viewed as a feature that enables tighter security and is always enabled in android. This question on SO briefly discusses as well. Also, just to let you know, -fpic is the faster version of -fPIC, so if you must use -fPIC try -fpic instead - link to gcc docs. For x86 it might not make a difference, but you need to check this for yourself/ask on gcc-help.

baibo
  • 448
  • 4
  • 20
  • Tighter security because it enables address-space layout randomization. I don't think PIC without ASLR gains you anything security-wise. I assume this is why [OS X requires position-independent code in executables, not just libraries](https://developer.apple.com/library/mac/documentation/DeveloperTools/Conceptual/MachOTopics/1-Articles/x86_64_code.html). (BTW, this explains the first comment on the question saying it's not reproducible on OSX. Either `-fPIC` doesn't stop inlining, or it can't happen in either case (because OSX does support symbol interposition). TL;DR: `static` is good! – Peter Cordes Mar 31 '16 at 05:09
  • Agner Fog's [proposal for an extensible instruction set](http://www.agner.org/optimize/blog/read.php?i=421#abi) doesn't support symbol interposition, which removes the overhead of indirection through the GOT even for PIC. In x86-64, most of the overhead of PIC is the indirection, not the relative addressing. RIP-relative addressing is only slightly more expensive than using absolute addressing, and is only needed for accessing global data. Function calls are relative anyway so there's zero overhead. (even in 32bit, the `call` instruction uses a 32bit relative displacement.) – Peter Cordes Mar 31 '16 at 05:18
  • I don't know much about this proposal or the specifics of PIC code, but I also suggested that it "might" not be slower for x86. Also, my comment about gcc people not caring about PIC performance still stands - which libraries are you refering to? – baibo Mar 31 '16 at 16:34
  • `zlib` (gzip) is one of the best examples of high performance software compiled as a library: computationally expensive but not using asm. There's also OpenSSL, BLAS / LAPACK libraries (like ATLAS), video encoders like libx264, OpenGL drivers, GUI libraries, `libc`, `libm`, `libstdc++`, etc. Not to mention that a lot of desktop software builds most of its code as libraries. e.g. Chromium builds the v8 javascript engine as a library: `/usr/lib/chromium-browser/libs/libv8.so`. `ldd` on many GUI programs will show huge amounts of libraries, some of them with important code for the program. – Peter Cordes Mar 31 '16 at 18:23
  • I don't want to be mean when I say this, it's a genuine question - have tou talked to people in the gcc community about whether they care about the performance of PIC code? I've talked to one such person and he told me that it's generally second hand when it comes to what optimisations they select on doing next. That being said, I did not realise all of these programs are compiled as libraries (even though I actually knew about v8 and blas/lapack, I just did not made the connection that they need to be performant AND use PIC to be as a library). – baibo Mar 31 '16 at 18:41
  • 1
    No, I haven't talked to gcc people about PIC myself. Are you sure you were interpreting what you heard correctly? Most gcc optimizations apply equally well to PIC and non-PIC, so it's not that performance of PIC code doesn't matter. It sounds more like he meant that optimizations that *only* help avoid the PIC overhead aren't as high a priority (probably because gcc already does a decent job). It's not like gcc just gives up on important optimizations because nobody bothered to enable them in PIC mode. BTW, database libraries are another good example of perf-critical library code. – Peter Cordes Mar 31 '16 at 19:04
  • 2
    @PeterCordes GCC's ability to optimize code is significantly hindered `-fPIC` (e.g. function inlining or cloning is basically disabled). This is done by design, to support symbol interposition. Ways to overcome this are to use private symbol aliases, `-fno-semantic-interposition` and hidden visibility. You can read [this infamous post](https://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/) for more details. – yugr Jul 31 '18 at 10:15
  • 3
    `-flto` still has to preserve the interposition semantics so I don't think it'll be able to do any non-fPIC-friendly optimizations. – yugr Jul 31 '18 at 10:22
  • @yugr: Two years ago when I wrote those comments I'm not sure I understood how much some codebases depend on tiny functions inlining across translation units with LTO, or failed to use `static` on small functions in the same translation unit. Also, even -fPIE has some cost (on x86), but it doesn't include symbol interposition. I wrote about that vs. PIC in [32-bit absolute addresses no longer allowed in x86-64 Linux?](https://stackoverflow.com/q/43367427) – Peter Cordes Jul 31 '18 at 17:09