Poor C performance with both pthread and printf

Question

I'm testing a c code for Linux with large arrays to measure thread performance, the application scales very well when threads are increased until max cores (8 for Intel 4770), but this is only for the pure math part of my code.

If I add the printf part for resulted arrays then the times becomes too large, from few seconds to several minutes even if redirected to a file, when printf those arrays should add just a few seconds.

The code:

(gcc 7.5.0-Ubuntu 18.04)

without printf loop:

gcc -O3 -m64 exp_multi.c -pthread -lm

with printf loop:

gcc -DPRINT_ARRAY -O3 -m64 exp_multi.c -pthread -lm

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <pthread.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
#define num_threads 8
static double xv[MAXSIZE];
static double yv[MAXSIZE];

/*    gcc -O3 -m64 exp_multi.c -pthread -lm    */


void* run(void *received_Val){
    int single_val = *((int *) received_Val);
    int r;
    int i;
    double p;

  for (r = 0; r < REIT; r++) {
    p = XXX + 0.00001*single_val*MAXSIZE/num_threads;

    for (i = single_val*MAXSIZE/num_threads; i < (single_val+1)*MAXSIZE/num_threads; i++) {

      xv[i]=p;
      yv[i]=exp(p);

    p += 0.00001;
    }

  }

    return 0;
}


int main(){
    int i;
      pthread_t tid[num_threads];


    for (i=0;i<num_threads;i++){
      int *arg = malloc(sizeof(*arg));
      if ( arg == NULL ) {
        fprintf(stderr, "Couldn't allocate memory for thread arg.\n");
        exit(1);
      }

      *arg = i;
        pthread_create(&(tid[i]), NULL, run, arg);
    }

    for(i=0; i<num_threads; i++)
    {
        pthread_join(tid[i], NULL);
    }

#ifdef PRINT_ARRAY
    for (i=0;i<MAXSIZE;i++){
        printf("x=%.20lf, e^x=%.20lf\n",xv[i],yv[i]);
    }
#endif

    return 0;
}

malloc in pthread_create passing an integer as the last argument as suggested in this post.

I tried, without success, clang, add free(tid) instruction, avoid using malloc instruction, reverse loops, only 1 unidimensional array, 1 thread version without pthread.

EDIT2: I think the exp function is processor resource intensive, probably affected by the per core cache or SIMD resources implemented by the processor generation. The following sample code is based on a licensed code posted on Stack Overflow.

This code runs fast with or without the printf loop, and exp from math.h has been improved a few years ago, it can be around x40 faster, at least on the Intel 4770 (Haswell), this link is a known test code for math library vs SSE2, and now exp speed of math should be close to the AVX2 algorithm optimized for float and x8 parallel calculations.

Test results: expf vs other SSE2 algortihm (exp_ps):

sinf .. ->            55.5 millions of vector evaluations/second ->  12 cycles/value
cosf .. ->            57.3 millions of vector evaluations/second ->  11 cycles/value
sincos (x87) .. ->    9.1 millions of vector evaluations/second ->   71 cycles/value
expf .. ->            61.4 millions of vector evaluations/second ->  11 cycles/value
logf .. ->            55.6 millions of vector evaluations/second ->  12 cycles/value
cephes_sinf .. ->     52.5 millions of vector evaluations/second ->  12 cycles/value
cephes_cosf .. ->     41.9 millions of vector evaluations/second ->  15 cycles/value
cephes_expf .. ->     18.3 millions of vector evaluations/second ->  35 cycles/value
cephes_logf .. ->     20.2 millions of vector evaluations/second ->  32 cycles/value
sin_ps .. ->          54.1 millions of vector evaluations/second ->  12 cycles/value
cos_ps .. ->          54.8 millions of vector evaluations/second ->  12 cycles/value
sincos_ps .. ->       54.6 millions of vector evaluations/second ->  12 cycles/value
exp_ps .. ->          42.6 millions of vector evaluations/second ->  15 cycles/value
log_ps .. ->          41.0 millions of vector evaluations/second ->  16 cycles/value

/* Performance test exp(x) algorithm

based on AVX implementation of Giovanni Garberoglio


 Copyright (C) 2020 Antonio R.


     AVX implementation of exp:
     Modified code from this source: https://github.com/reyoung/avx_mathfun
     Based on "sse_mathfun.h", by Julien Pommier
     http://gruntthepeon.free.fr/ssemath/
     Copyright (C) 2012 Giovanni Garberoglio
     Interdisciplinary Laboratory for Computational Science (LISC)
     Fondazione Bruno Kessler and University of Trento
     via Sommarive, 18
     I-38123 Trento (Italy)


    This software is provided 'as-is', without any express or implied
    warranty.  In no event will the authors be held liable for any damages
    arising from the use of this software.
    Permission is granted to anyone to use this software for any purpose,
    including commercial applications, and to alter it and redistribute it
    freely, subject to the following restrictions:
    1. The origin of this software must not be misrepresented; you must not
       claim that you wrote the original software. If you use this software
       in a product, an acknowledgment in the product documentation would be
       appreciated but is not required.
    2. Altered source versions must be plainly marked as such, and must not be
       misrepresented as being the original software.
    3. This notice may not be removed or altered from any source distribution.
    (this is the zlib license)

  */

/*    gcc -O3 -m64 -Wall -mavx2 -march=haswell  expc.c -lm     */

#include <stdio.h>
#include <immintrin.h>
#include <math.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5


__m256 exp256_ps(__m256 x) {

/*
  To increase the compatibility across different compilers the original code is
  converted to plain AVX2 intrinsics code without ingenious macro's,
  gcc style alignment attributes etc.
  Moreover, the part "express exp(x) as exp(g + n*log(2))" has been significantly simplified.
  This modified code is not thoroughly tested!
*/


__m256   exp_hi        = _mm256_set1_ps(88.3762626647949f);
__m256   exp_lo        = _mm256_set1_ps(-88.3762626647949f);

__m256   cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341f);
__m256   inv_LOG2EF    = _mm256_set1_ps(0.693147180559945f);

__m256   cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256   cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256   cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256   cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256   cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256   cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256   fx;
__m256i  imm0;
__m256   one           = _mm256_set1_ps(1.0f);

        x     = _mm256_min_ps(x, exp_hi);
        x     = _mm256_max_ps(x, exp_lo);

  /* express exp(x) as exp(g + n*log(2)) */
        fx     = _mm256_mul_ps(x, cephes_LOG2EF);
        fx     = _mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);
__m256  z      = _mm256_mul_ps(fx, inv_LOG2EF);
        x      = _mm256_sub_ps(x, z);
        z      = _mm256_mul_ps(x,x);

__m256  y      = cephes_exp_p0;
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p1);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p2);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p3);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p4);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p5);
        y      = _mm256_mul_ps(y, z);
        y      = _mm256_add_ps(y, x);
        y      = _mm256_add_ps(y, one);

  /* build 2^n */
        imm0   = _mm256_cvttps_epi32(fx);
        imm0   = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
        imm0   = _mm256_slli_epi32(imm0, 23);
__m256  pow2n  = _mm256_castsi256_ps(imm0);
        y      = _mm256_mul_ps(y, pow2n);
        return y;
}

int main(){
    int r;
    int i;
    float p;
    static float xv[MAXSIZE];
    static float yv[MAXSIZE];
    float *xp;
    float *yp;

  for (r = 0; r < REIT; r++) {
    p = XXX;
    xp = xv;
    yp = yv;

    for (i = 0; i < MAXSIZE; i += 8) {

    __m256 x = _mm256_setr_ps(p, p + 0.00001, p + 0.00002, p + 0.00003, p + 0.00004, p + 0.00005, p + 0.00006, p + 0.00007);
    __m256 y = exp256_ps(x);

    _mm256_store_ps(xp,x);
    _mm256_store_ps(yp,y);

    xp += 8;
    yp += 8;
    p += 0.00008;
    }

  }

  for (i=0;i<MAXSIZE;i++){
      printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
  }

    return 0;
}

For comparison, this is the code example with exp (x) from the math library, single thread and float.

#include <stdio.h>
#include <math.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
/*    gcc -O3 -m64 exp_st.c -lm    */


int main(){
    int r;
    int i;
    float p;
    static float xv[MAXSIZE];
    static float yv[MAXSIZE];


  for (r = 0; r < REIT; r++) {
    p = XXX;

    for (i = 0; i < MAXSIZE; i++) {

      xv[i]=p;
      yv[i]=expf(p);

    p += 0.00001;
    }
  }

  for (i=0;i<MAXSIZE;i++){
      printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
  }

    return 0;
}

SOLUTION: As Andreas Wenzel said, the gcc compiler is smart enough and it decides that it is not necessary to actually write the results to the array, these writes are optimized away by the compiler. After new performance tests I made based on new information, or before I committed several mistakes or I assumed wrong facts, it seems the results are clearer: exp (double arg), or expf( float arg) which is x2+ exp(double arg), have been improved but it is not as a fast AVX2 algorithm (x8 parallel float arg), which is around x6 faster than SSE2 algorithm (x4 parallel float arg). Here are some results, as expected for Intel Hyper-Threading CPUs, excepts for SSE2 algorithm:

exp (double arg) single thread: 18 min 46 sec

exp (double arg) 4 threads: 5 min 4 sec

exp (double arg) 8 threads: 4 min 28 sec

expf (float arg) single thread: 7 min 32 sec

expf (float arg) 4 threads: 1 min 58 sec

expf (float arg) 8 threads: 1 min 41 sec

Relative error**:

i           x                     y = expf(x)           double precision exp        relative error

i = 0       x =-5.000000000e+00   y = 6.737946998e-03   exp_dbl = 6.737946999e-03   rel_err =-1.124224480e-10
i = 124000  x =-3.758316040e+00   y = 2.332298271e-02   exp_dbl = 2.332298229e-02   rel_err = 1.803005727e-08
i = 248000  x =-2.518329620e+00   y = 8.059411496e-02   exp_dbl = 8.059411715e-02   rel_err =-2.716802480e-08
i = 372000  x =-1.278343201e+00   y = 2.784983218e-01   exp_dbl = 2.784983343e-01   rel_err =-4.490403948e-08
i = 496000  x =-3.867173195e-02   y = 9.620664716e-01   exp_dbl = 9.620664730e-01   rel_err =-1.481617428e-09
i = 620000  x = 1.201261759e+00   y = 3.324308872e+00   exp_dbl = 3.324308753e+00   rel_err = 3.571995830e-08
i = 744000  x = 2.441616058e+00   y = 1.149159718e+01   exp_dbl = 1.149159684e+01   rel_err = 2.955980805e-08
i = 868000  x = 3.681602478e+00   y = 3.970997620e+01   exp_dbl = 3.970997748e+01   rel_err =-3.232306688e-08
i = 992000  x = 4.921588898e+00   y = 1.372204742e+02   exp_dbl = 1.372204694e+02   rel_err = 3.563072184e-08

*SSE2 algorithm by Julien Pommier, x6,8 speed increase from one thread to 8 threads. My performance test code uses aligned(16) for typedef union of vector/4 float array passed to the library, instead of aligned whole float array. This may causes lower performance, at least for other AVX2 code, its performance improvement with multithread too seems good for Intel Hyper-Threading but at a lower speed, time increased between x2.5-x1.5. Maybe SSE2 code could be sped up with better array alignment which I couldn't improve:

exp_ps (x4 parallel float arg) single thread: 12 min 7 sec

exp_ps (x4 parallel float arg) 4 threads: 3 min 10 sec

exp_ps (x4 parallel float arg) 8 threads: 1 min 46 sec

Relative error**:

i           x                     y = exp_ps(x)         double precision exp        relative error

i = 0       x =-5.000000000e+00   y = 6.737946998e-03   exp_dbl = 6.737946999e-03   rel_err =-1.124224480e-10
i = 124000  x =-3.758316040e+00   y = 2.332298271e-02   exp_dbl = 2.332298229e-02   rel_err = 1.803005727e-08
i = 248000  x =-2.518329620e+00   y = 8.059412241e-02   exp_dbl = 8.059411715e-02   rel_err = 6.527768787e-08
i = 372000  x =-1.278343201e+00   y = 2.784983218e-01   exp_dbl = 2.784983343e-01   rel_err =-4.490403948e-08
i = 496000  x =-3.977407143e-02   y = 9.610065222e-01   exp_dbl = 9.610065335e-01   rel_err =-1.174323454e-08
i = 620000  x = 1.200158238e+00   y = 3.320642233e+00   exp_dbl = 3.320642334e+00   rel_err =-3.054731957e-08
i = 744000  x = 2.441616058e+00   y = 1.149159622e+01   exp_dbl = 1.149159684e+01   rel_err =-5.342903415e-08
i = 868000  x = 3.681602478e+00   y = 3.970997620e+01   exp_dbl = 3.970997748e+01   rel_err =-3.232306688e-08
i = 992000  x = 4.921588898e+00   y = 1.372204742e+02   exp_dbl = 1.372204694e+02   rel_err = 3.563072184e-08

AVX2 algorithm (x8 parallel float arg) single thread: 1 min 45 sec

AVX2 algorithm (x8 parallel float arg) 4 threads: 28 sec

AVX2 algorithm (x8 parallel float arg) 8 threads: 27 sec

Relative error**:

i           x                     y = exp256_ps(x)      double precision exp        relative error

i = 0       x =-5.000000000e+00   y = 6.737946998e-03   exp_dbl = 6.737946999e-03   rel_err =-1.124224480e-10
i = 124000  x =-3.758316040e+00   y = 2.332298271e-02   exp_dbl = 2.332298229e-02   rel_err = 1.803005727e-08
i = 248000  x =-2.516632080e+00   y = 8.073104918e-02   exp_dbl = 8.073104510e-02   rel_err = 5.057888540e-08
i = 372000  x =-1.279417157e+00   y = 2.781994045e-01   exp_dbl = 2.781993997e-01   rel_err = 1.705288467e-08
i = 496000  x =-3.954863176e-02   y = 9.612231851e-01   exp_dbl = 9.612232069e-01   rel_err =-2.269774967e-08
i = 620000  x = 1.199879169e+00   y = 3.319715738e+00   exp_dbl = 3.319715775e+00   rel_err =-1.119642824e-08
i = 744000  x = 2.440370798e+00   y = 1.147729492e+01   exp_dbl = 1.147729571e+01   rel_err =-6.896860199e-08
i = 868000  x = 3.681602478e+00   y = 3.970997620e+01   exp_dbl = 3.970997748e+01   rel_err =-3.232306688e-08
i = 992000  x = 4.923286438e+00   y = 1.374535980e+02   exp_dbl = 1.374536045e+02   rel_err =-4.676466368e-08

**The relative errors are the same, since the codes with SSE2 and AVX2 use identical algorithms, and it is more than likely that it is also that of the library function exp(x).

Source code AVX2 algorithm multithread

/* Performance test of a multithreaded exp(x) algorithm

based on AVX implementation of Giovanni Garberoglio


 Copyright (C) 2020 Antonio R.


     AVX implementation of exp:
     Modified code from this source: https://github.com/reyoung/avx_mathfun
     Based on "sse_mathfun.h", by Julien Pommier
     http://gruntthepeon.free.fr/ssemath/
     Copyright (C) 2012 Giovanni Garberoglio
     Interdisciplinary Laboratory for Computational Science (LISC)
     Fondazione Bruno Kessler and University of Trento
     via Sommarive, 18
     I-38123 Trento (Italy)


    This software is provided 'as-is', without any express or implied
    warranty.  In no event will the authors be held liable for any damages
    arising from the use of this software.
    Permission is granted to anyone to use this software for any purpose,
    including commercial applications, and to alter it and redistribute it
    freely, subject to the following restrictions:
    1. The origin of this software must not be misrepresented; you must not
       claim that you wrote the original software. If you use this software
       in a product, an acknowledgment in the product documentation would be
       appreciated but is not required.
    2. Altered source versions must be plainly marked as such, and must not be
       misrepresented as being the original software.
    3. This notice may not be removed or altered from any source distribution.
    (this is the zlib license)

  */

  /*    gcc -O3 -m64 -mavx2 -march=haswell expc_multi.c -pthread -lm    */

#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
#include <math.h>
#include <pthread.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
#define num_threads 4

typedef float  FLOAT32[MAXSIZE] __attribute__((aligned(4)));
static FLOAT32 xv;
static FLOAT32 yv;



__m256 exp256_ps(__m256 x) {

/*
  To increase the compatibility across different compilers the original code is
  converted to plain AVX2 intrinsics code without ingenious macro's,
  gcc style alignment attributes etc.
  Moreover, the part "express exp(x) as exp(g + n*log(2))" has been significantly simplified.
  This modified code is not thoroughly tested!
*/


__m256   exp_hi        = _mm256_set1_ps(88.3762626647949f);
__m256   exp_lo        = _mm256_set1_ps(-88.3762626647949f);

__m256   cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341f);
__m256   inv_LOG2EF    = _mm256_set1_ps(0.693147180559945f);

__m256   cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256   cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256   cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256   cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256   cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256   cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256   fx;
__m256i  imm0;
__m256   one           = _mm256_set1_ps(1.0f);

        x     = _mm256_min_ps(x, exp_hi);
        x     = _mm256_max_ps(x, exp_lo);

  /* express exp(x) as exp(g + n*log(2)) */
        fx     = _mm256_mul_ps(x, cephes_LOG2EF);
        fx     = _mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);
__m256  z      = _mm256_mul_ps(fx, inv_LOG2EF);
        x      = _mm256_sub_ps(x, z);
        z      = _mm256_mul_ps(x,x);

__m256  y      = cephes_exp_p0;
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p1);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p2);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p3);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p4);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p5);
        y      = _mm256_mul_ps(y, z);
        y      = _mm256_add_ps(y, x);
        y      = _mm256_add_ps(y, one);

  /* build 2^n */
        imm0   = _mm256_cvttps_epi32(fx);
        imm0   = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
        imm0   = _mm256_slli_epi32(imm0, 23);
__m256  pow2n  = _mm256_castsi256_ps(imm0);
        y      = _mm256_mul_ps(y, pow2n);
        return y;
}


void* run(void *received_Val){
    int single_val = *((int *) received_Val);
    int r;
    int i;
    float p;
    float *xp;
    float *yp;

  for (r = 0; r < REIT; r++) {
    p = XXX + 0.00001*single_val*MAXSIZE/num_threads;
    xp = xv + single_val*MAXSIZE/num_threads;
    yp = yv + single_val*MAXSIZE/num_threads;

    for (i = single_val*MAXSIZE/num_threads; i < (single_val+1)*MAXSIZE/num_threads; i += 8) {

      __m256 x = _mm256_setr_ps(p, p + 0.00001, p + 0.00002, p + 0.00003, p + 0.00004, p + 0.00005, p + 0.00006, p + 0.00007);
      __m256 y = exp256_ps(x);

      _mm256_store_ps(xp,x);
      _mm256_store_ps(yp,y);

      xp += 8;
      yp += 8;
      p += 0.00008;

    }
  }

    return 0;
}


int main(){
    int i;
      pthread_t tid[num_threads];


    for (i=0;i<num_threads;i++){
      int *arg = malloc(sizeof(*arg));
      if ( arg == NULL ) {
        fprintf(stderr, "Couldn't allocate memory for thread arg.\n");
        exit(1);
      }

      *arg = i;
        pthread_create(&(tid[i]), NULL, run, arg);
    }

    for(i=0; i<num_threads; i++)
    {
        pthread_join(tid[i], NULL);
    }

    for (i=0;i<MAXSIZE;i++){
        printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
    }

    return 0;
}

Charts overview: exp (double arg) without printf loop, not real performance, as Andreas Wenzel found, gcc doesn't calculate exp(x) when results are not going to be printf, even float version is much slower because its different few assembly instructions. Although graph may be useful for some assembly algorithm that only uses low-level CPU cache/registers. expf (float arg) real performance or with printf loop AVX2 algorithm, the best performance.

*I tried add free(tid) without success.* Since `tid` isn't allocated by `malloc()` or friends, you can't `free()` it... — Shawn, Oct 15 '20 at 17:24
Please show your compilation command. With [GCC](http://gcc.gnu.org/) it might be `gcc -Wall -Wextra -O2 -g your-code.c -o your-prog`. Then use both [GDB](https://www.gnu.org/software/gdb/) and [valgrind](http://valgrind.org/) *after having read documentation* of GCC, of GDB, of valgrind — Basile Starynkevitch, Oct 15 '20 at 17:26
Yes, I/O is expensive. Just how expensive depends on multiple factors, but large among them are the characteristics of the target device. — John Bollinger, Oct 15 '20 at 17:32
^^^^ ... and console output is dreadfully slow. redirect those million `printf` calls to a file and I bet your runtime is considerably faster. — WhozCraig, Oct 15 '20 at 17:47
Calling `printf` requires a mutex to be acquired, because the ISO C standard requires it to be thread-safe. See the following question on how to disable this: [Non-threadsafe file I/O in C/C++](https://stackoverflow.com/questions/1200247/non-threadsafe-file-i-o-in-c-c) Note that this question was asked at a time before C11 was released, which required I/O functions to be thread-safe. However, I don't think that this is the issue, because acquiring a mutex should cost about 70 CPU cycles (assuming no thread contention), which corresponds to about 20ms if done one million times. — Andreas Wenzel, Oct 15 '20 at 20:27
Why do you think the performance is poor? What do you compare it with? `loop with calculations, stored in arrays` is not your actual code, so we cannot tell why your code might have bad performance. Please post a [mcve]. — n. m. could be an AI, Oct 15 '20 at 21:31
Printing 1 million lines of text with a total size of 27MB using `fprintf` to a file takes about 1 second with me with full optimizations and 4 seconds for a debug build. I see no way that the code you posted could cause a program running time of "several minutes". Please post a [mre]. — Andreas Wenzel, Oct 15 '20 at 21:54
Just as a side note: Instead of hard-coding the value `1000000` everywhere, your program would be easier to read if you used a `#define` constant, as you do with `num_threads`. This is especially true because the number `100000` you use in one place can easily be confused with the number `1000000`. — Andreas Wenzel, Oct 16 '20 at 16:53
I doubt that your most recent edit is a [mre] of the problem you describe. Did you verify that it reproduces the problem, before you posted it? When I run your program and redirect standard input to a file, removing the `printf` statements only saves about 3 seconds of the program running time (which is the expected amount of time required for writing 54MB of text data). Please provide a [mre] of both situations you describe, i.e. one for the situation in which your program runs properly and one for the situation in which it doesn't run properly. — Andreas Wenzel, Oct 16 '20 at 18:18
@Andreas Wenzel: Your result is what I was expecting, may be is something related with other ubuntu 18.04 software or c version. I made several tests, with and without threads, without malloc, changed number of loops and arrays sizes, and only one unidimensional array,.. no changes or even worse on my Intel 4770 (4 real cores + 4 hyper threading). For the code I posted, without printf loop the running time is 12 seconds, with printf redirected to a file is 4 min 30 sec. — Antonio R., Oct 16 '20 at 18:59
@AntonioR.: Hmmm, that is very strange, the total running time with me is about 9 minutes on my Intel i5-760 CPU (1st Generation Intel Core, yours is 4th Generation). It is 4 physical cores and 4 logical cores (no hyperthreading). Adding the `printf` statements only adds 3 seconds to the total time, so it is negligible. So it seems that I am experiencing your problem all the time, even without the `printf` statements. — Andreas Wenzel, Oct 16 '20 at 19:06
@AntonioR.: There is a big difference between 12 seconds and 4 minutes. I suspect that the compiler may be optimizing away the outer loop, as it does the exact same thing 100,000 times. However, I cannot verify this without a [mre] of both test cases. — Andreas Wenzel, Oct 16 '20 at 19:25
Ok, thanks, now I have the exact code of both test cases and will perform some tests. Can you specify the exact compiler version of gcc? You may want to try the clang compiler, to see if the results are similar. Also, can you look at the assembly instructions that the compiler emits, do see whether it optimized away the outer loop? Or don't you know assembly? — Andreas Wenzel, Oct 16 '20 at 20:05
@Andreas Wenzel: Regardless whether code or constants are changed, printf loop always adds a lot of time. I update the post with the two codes, with and without printf loop, and with #define constants. — Antonio R., Oct 16 '20 at 20:07
@Andreas Wenzel: gcc version 7.5.0. I have to take some time to check assembly and clang. — Antonio R., Oct 16 '20 at 20:13
You might want to output some diagnostic `fprintf` statements to `stderr` with timing information obtained with [`clock_gettime`](https://linux.die.net/man/3/clock_gettime) using the `CLOCK_MONOTONIC` parameter. Such a diagnostic output would be especially useful between the lines with `pthread_join` and the `for` loop with the `printf` statements. — Andreas Wenzel, Oct 16 '20 at 20:15
Since the question now contains a [mre] for both test cases, I have voted to reopen the question. — Andreas Wenzel, Oct 16 '20 at 20:29
@Andreas Wenzel: Because the problem is not related to pthread, I will post a new question. — Antonio R., Oct 17 '20 at 09:44
I believe I have solved the mystery. When you don't print the arrays at the end of the program, the compiler is smart enough to realize that it is not necessary to actually write the results to the array, because these results are never used. Therefore, these writes are optimized away by the compiler. Also, it seems to optimize away the call to the function `exp` under certain conditions, probably because that function call has no effect, unless the input number is invalid. This is visible from the assembly code, as you can see when you test it on this site: http://www.godbolt.org/ — Andreas Wenzel, Oct 18 '20 at 05:47
@Andreas Wenzel: I saw what you say, if I understood you is a compiler bad optimization, really very bad, or a missed library option, I'm not familiar with C. I found a possible solution, the problem is related to the calls to the math library together with printf, it seems that the exp function affects in a special way. If exp is substituted with an algorithm, then printf only adds a few seconds as expected. — Antonio R., Oct 18 '20 at 19:42
@AntonioR.: The problem is not that the compiler is "bad", but rather that the compiler was too smart for your test and took a short-cut, making your test meaningless. In other words, it outsmarted your test. The compiler saw that writing the results to the array had no observable effects, because you weren't printing them afterwards. Therefore, it considered most of your calculations redundant and optimized them out of your program. — Andreas Wenzel, Oct 19 '20 at 01:37
The only reason why the compiler didn't optimize the nested loop out of your program completely was that calling the function `exp` has one observable side-effect: If that function is called with an invalid floating point value (such as a [NaN](https://en.wikipedia.org/wiki/NaN)), the function must raise a floating point exception. Therefore, the compiler optmized the loop in such a way that it checked if the floating point value was invalid, and only called the function `exp` if this was the case (so it never actually called it and it only performed the check). — Andreas Wenzel, Oct 19 '20 at 01:44
If you want to test a program on how fast it can perform calculations, then you must force the compiler to print the results or the results must have some other observable effect (for example they must be written to a file or sent over a network). Otherwise, the compiler may notice that the results have no effect. In that case, it won't calculate them in the first place. Instead, it will optimize them away. — Andreas Wenzel, Oct 19 '20 at 02:10
@Andreas Wenzel: I'm not familiarized with C, but about cache missed or smart optimizations of gcc I think "Neither yes, nor no, but quite the opposite", a Spanish locution without English translate. After some test I think the exp function is processor resource intensive, probably affected by the per core cache or SIMD resources implemented by the processor generation. — Antonio R., Oct 19 '20 at 12:02
@AntonioR.: I suggest that you run your program instruction by instruction in a debugger in order to see whether it is calling your function `exp` at all if you remove the `printf` lines. My guess is that it is not. However, in order to do this, you must set your debugger to assembly mode, so that you see and execute the individual CPU instructions. — Andreas Wenzel, Oct 19 '20 at 12:10
@AntonioR.: Ah, I see that you are now calling a different `exp` function. Therefore, the compiler may no longer be aware that the function has no observable side-effects and will therefore be unable to optimize it away, even if the input is valid. That is probably why it no longer makes a difference whether you print the results or not. — Andreas Wenzel, Oct 19 '20 at 15:16
@Andreas Wenzel: The new exp function is based on an AVX2 algorithm, in my tests it is as fast as exp (x) from math.h, only a little slower although it does not show differences when the printf loop is added. But if the AVX2 algorithm code becomes multithreaded then it is slower with the printf loop, while exp (x) of math.h with the printf loop is always slow, with or without multithreading. In my opnion exp (x) of math.h makes intensive use of processor resources, even with a single thread which is a bottleneck for other threads, at least on the Intel 4770 (Haswell). — Antonio R., Oct 20 '20 at 08:36
@AntonioR.: The main effect of `printf` is that it forces the compiler to actually perform the calculations. If you omit these `printf` statements, the compiler may be able to optimize away most of the calculations, as it has done at least with the `exp` function in `math.h`. Therefore, if you want to compare the speed of different implementations, you must always make sure that the results are printed in some way or that the results have some other kind of observable effect. Anything else will not be a fair comparison. — Andreas Wenzel, Oct 20 '20 at 09:00
@AntonioR.: You don't have to print the entire result array, though (which is several megabytes). It is enough if you print one number which is the final result. This final result must be dependant on all the other results, though (for example the sum of all results). If the final result is not dependant on all the other results, the compiler may notice this and optimize away the redundant calculations. — Andreas Wenzel, Oct 20 '20 at 09:02
@Andreas Wenzel: I know what you say about compiler optimizations, but that doesn't explain everything. The AVX algorithm is basically the same calculation as exp single thread and the performance is as expected with or without printf loop arrays. AVX2 algorithm is as fast as exp of math library, it works as expected or it is in line with my previous tests, but it is only float precision and worse speed & accuracy. Anyway, I don't think is relevant to investigate a problem dependent on the specific hardware, or it would be interest if tested with latest Intel and AMD CPUs, and Xeon and Epyc. — Antonio R., Oct 24 '20 at 07:45
I'm afraid that I am unable to read your 3D-graphs. They look very fancy, but the lines are so thick that I cannot read the values that they represent. — Andreas Wenzel, Oct 26 '20 at 14:37

Andreas Wenzel · Accepted Answer · 2021-07-15T08:59:46.780

When you don't print the arrays at the end of the program, the gcc compiler is smart enough to realize that the results of the calculations have no observable effects. Therefore, the compiler decides that it is not necessary to actually write the results to the array, because these results are never used. Instead, these writes are optimized away by the compiler.

Also, when you don't print the results, the library function exp has no observable effects, provided it is not called with input that is so high that it would cause a floating-point overflow (which would cause the function to raise a floating point exception). This allows the compiler to optimize these function calls away, too.

As you can see in the assembly instructions emitted by the gcc compiler for your code which does not print the results, the compiled program doesn't unconditionally call the function exp, but instead tests if the input to the function exp would be higher than 7.09e2 (to ensure that no overflow will occur). Only if an overflow would occur, will the program jump to the code which calls the function exp. Here is the relevant assembly code:

ucomisd xmm1, xmm3
jnb .L9

In the above assembly code, the CPU register xmm3 contains the double-precision floating-point value 7.09e2. If the input is higher than this constant, the function exp would cause a floating-point overflow, because the result cannot be represented in a double-precision floating-point value.

Since your input is always valid and low enough to not cause a floating-point overflow, your program will never perform this jump, so it will never actually call the function exp.

This explains why your code is so much faster when you do not print the results. If you do not print the results, your compiler will determine that the calculations have no observable effects, so it will optimize them away.

Therefore, if you want the compiler to actually perform the calculations, you must ensure that the calculations have some observable effect. This does not mean that you have to actually print all results (which are several megabytes large). It is sufficient if you just print one line which is dependant on all the results (for example the sum of all results).

However, if you replace the function call to the library function exp with a call to some other custom function, then, at least in my tests, the compiler is not smart enough to realize that the function call has no observable effects. In that case, it is unable to optimize the function call away, even if you do not print the results of the calculations.

For the reasons stated above, if you want to compare the performance of both functions, you must ensure that the calculations actually take place, by making sure that the results have an observable effect. Otherwise, you run the risk of the compiler optimizing away at least some of the calculations, and the comparison would not be fair.

I'm not sure that is the solution, I made various performance test and exp(x) of math always get similar speed as AVX2 algorithm I posted above, or compared with the performance test from [link](http://gruntthepeon.free.fr/ssemath/sse_mathfun_test.c). This AVX2 algorithm has no printff performance problems, but only in single thread mode. So, I "suspect" the main problem is the CPU cache, because gcc compiler adds a lot of assembly code from function exp of math library, while AVX2 algorithm is much more simple and CPU cache would speed it up. — Antonio R., Oct 23 '20 at 18:40
@AntonioR.: Can you please post your code which uses the AVX2 version with multiple threads? I have a suspicion what could be wrong, but I cannot confirm it unless I see the code. By the way, instead of posting the whole code for multiple similar versions, you might want to use conditional compilation, for example `#ifdef PRINT_RESULTS`, then the condititional code and then `#endif`. That way, you won't have to post your whole program so many times. In that case, all you have to do is specify exactly which preprocessor definition you changed. That way, your question will be shorter. — Andreas Wenzel, Oct 23 '20 at 23:00
I don't have the multithread version, instead I tested a very simple bash script that launched multiple single thread instances and it printf without many delays, I made a mistake due to various tests I did. Well, really I had a multithread code, and a fork version so all parallel world will be in the competition, but throw a segfault due array alignment which I'm investigating. The more relevant is the c code of the above link: gruntthepeon.free.fr/ssemath/sse_mathfun_test.c , a performance test for math.h, that's why I say exp of math (single thr.) should be as fast as the AVX2 algorithm. — Antonio R., Oct 24 '20 at 13:38
@AntonioR.: The point of using multithreading is to divide the data among several threads. If you call the same program several times in a bash script, then the overall data will to be processed will not be divided among the threads, unless you change the program to reduce the amount of data every program must process. Did you take this into account? — Andreas Wenzel, Oct 24 '20 at 18:40
I reduced the arrays size, although it is not a reliable test, the speed increases very well until x4 like should be normal with a Intel Hyper-Threading 4 cores, and not x8 speed like the pthread code without loop printf. On the other hand, there is the question of the test result I posted for my processor from the code of the above link: expf vs other SSE2 algorithm. Inside the c source you can see result for some others CPUs. May be SSE2 is not as fast as I supposed. — Antonio R., Oct 26 '20 at 01:23
After new performance tests I made based on new information, or before I committed several mistakes or I assumed wrong facts, it seems the results are clearer: exp (double arg), or expf( float arg) wich is x2+ exp(double arg), have been improved but It is not as a fast AVX2 algorithm (x8 parallel float arg) = x8 SSE2 algorithm (float arg). I just need to corroborate the results when the pthread version of AVX2 algorithm is fixed. — Antonio R., Oct 26 '20 at 11:32

score 0 · Answer 2 · answered Oct 15 '20 at 21:27

0

I don't think this has much to with pthread because your code only appears to call printf after the threads are joined. Instead, the poor performance is likely due to cache misses by needing to read from the xv and yv arrays in every iteration of the print loop.

answered Oct 15 '20 at 21:27

resiliware

56
2

Why should reading from two arrays at a same time in a sequential manner lead to cache misses? In this case, it should be very easy for the [hardware cache prefetcher](https://en.wikipedia.org/wiki/Cache_prefetching) to prevent cache misses. – Andreas Wenzel Oct 15 '20 at 21:56
As with all performance questions, it is highly dependent on the specific hardware and software configuration on which the code is run. Reading from large arrays in this manner can be the source of performance problems. Or the answer is there is no performance problem at all. – resiliware Oct 16 '20 at 01:53
I think that is the problem, after some tests, with a single thread version the performance is even worse, while with only math version single thread the measured time is x8 vs 8 threads version only math. – Antonio R. Oct 16 '20 at 10:53
After some extensive testing and studying the assembler instructions emitted by the compiler, I have found out that the problem has nothing to do with cache misses, but rather that the compiler was optimizing lots of stuff away, because the results were never used if they are never printed afterwards. See the main comments section for further information. – Andreas Wenzel Oct 18 '20 at 06:21
It's not seems the principal problem is cache misses. I found a possible solution, the problem is related to the calls to the math library together with printf, it seems that the exp function affects in a special way. If exp is substituted with an algorithm, then printf only adds a few seconds as expected. – Antonio R. Oct 18 '20 at 19:31
After some test I think the exp function is processor resource intensive, probably affected by the per core cache or SIMD resources implemented by the processor generation. – Antonio R. Oct 19 '20 at 11:53

Poor C performance with both pthread and printf

2 Answers2