Program runs 10x slower on 2.5GHz Xeon than 2.3GHz Core i5

Question

I am doing some Monte Carlo simulations with the following code.

#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <sys/time.h>
#include <sys/types.h>
#include <time.h>

#define L 20
#define N 200000


int main() {
    struct timespec ts_start, ts_end;
    clock_t clk_start, clk_end;
    struct rusage usage;
    struct timeval tv_ustart, tv_uend;
    struct timeval tv_sstart, tv_send;

    clock_gettime(CLOCK_MONOTONIC, &ts_start);
    clk_start = clock();
    getrusage(RUSAGE_SELF, &usage);
    tv_ustart = usage.ru_utime;
    tv_sstart = usage.ru_stime;

    /* 
     * the runtime of one simulation may vary a lot, so let's repeat it many times.
     */
    for (int iteration = 0; iteration < N; iteration += 1) {
        int node_num_by_depth[L + 1] = {1};

        for (int level = 0; level < L; level += 1) {
            for (int depth = level; depth != -1; depth -= 1) {
                int parent_num = node_num_by_depth[depth];
                node_num_by_depth[depth] = 0;
                for (int parent = 0; parent < parent_num; parent += 1) {
                    int child_num = 1 + arc4random() % 2;
                    int child_depth = depth + (child_num > 1);
                    node_num_by_depth[child_depth] += child_num;
                }
            }
        }
    }

    clock_gettime(CLOCK_MONOTONIC, &ts_end);
    clk_end = clock();
    getrusage(RUSAGE_SELF, &usage);
    tv_uend = usage.ru_utime;
    tv_send = usage.ru_utime;


    double elapsed = (double) (ts_end.tv_sec - ts_start.tv_sec)
                     + (double) (ts_end.tv_nsec - ts_start.tv_nsec) / 1E9;
    printf("Wall clock time elapsed: %g\n", elapsed);

    double cpu_time_used = ((double) (clk_end - clk_start)) / CLOCKS_PER_SEC;
    printf("CPU time elapsed: %g\n", cpu_time_used);

    printf("User CPU time elapsed: %lu.%06u\n",
           tv_uend.tv_sec - tv_ustart.tv_sec,
           tv_uend.tv_usec - tv_ustart.tv_usec);

    printf("System CPU time elapsed: %lu.%06u\n",
           tv_send.tv_sec - tv_sstart.tv_sec,
           tv_send.tv_usec - tv_sstart.tv_usec);
}

When running on my MacBook Pro with 2.3GHz dual-core Intel Core i5, Turbo Boost up to 3.6GHz, with 64MB of eDRAM, the output is

Wall clock time elapsed: 32.408
CPU time elapsed: 32.3566
User CPU time elapsed: 32.319151
System CPU time elapsed: 32.319114

When running on my FreeBSD server with CPU: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz (2499.96-MHz K8-class CPU), it gives me

Wall clock time elapsed: 396.563
CPU time elapsed: 396.414
User CPU time elapsed: 158.377633
System CPU time elapsed: 158.376629

In both cases, the compilation instruction I used is clang main.c -Ofast -march=native -o main.

Initially, the server was running Ubuntu 18.04, but I switched to FreeBSD after realizing that my program runs ridiculously slow. However, even on FreeBSD (where arc4random is provided natively), it is still significantly less efficient. Now, unless there is a very good reason, I don't want to switch to a new OS again.

I have profiled the program with gprof, and you can find the result here. We can see arc4random is taking quite a lot of time (30.8%), but this can't be the sole reason for a tenfold performance penalty.

I have added some other measures of time as pointed out by the comments. You can see there is a huge gap between the wall clock time and the CPU time. Can I narrow it? Anyway, what I care is how long it takes to execute my program, instead of how many CPU cycles it takes.

Here is some extended information about my MacBook Pro

Hardware Overview:

  Model Name: MacBook Pro
  Model Identifier: MacBookPro14,1
  Processor Name: Intel Core i5
  Processor Speed: 2.3 GHz
  Number of Processors: 1
  Total Number of Cores: 2
  L2 Cache (per Core): 256 KB
  L3 Cache: 4 MB
  Memory: 8 GB
  Boot ROM Version: 184.0.0.0.0
  SMC Version (system): 2.43f6
  Serial Number (system): XXXXXXXXXXXX
  Hardware UUID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX

My server is equipped with twelve Intel Skylake Xeon Platinum 8163 2.5GHz CPUs, whose specification is shown below

I have a big find! After replacing arc4random with random, there is a huge performance gain on the server.

MacBook Pro with random

Wall clock time elapsed: 5.61316
CPU time elapsed: 5.40623
User CPU time elapsed: 5.361931
System CPU time elapsed: 5.359096

Server with random

Wall clock time elapsed: 6.69183
CPU time elapsed: 6.6875
User CPU time elapsed: 6.690042
System CPU time elapsed: 6.689085

Isn't `clock_gettime` measuring the real time consumed by *all* of the processes that are running in parallel? You want to measure the processor tie consumed by your process only (such as https://linux.die.net/man/3/clock) — Eugene Sh., Mar 20 '19 at 15:19
So,instead you should use `getrusage()` (or use `time` or `times`) — wildplasser, Mar 20 '19 at 15:26
Have a look here: https://stackoverflow.com/questions/7215764/how-to-measure-the-actual-execution-time-of-a-c-program-under-linux — Jonny Schubert, Mar 20 '19 at 15:26
Servers run *many* processes in parallel. Does it explain the difference? — Eugene Sh., Mar 20 '19 at 16:39
@EugeneSh. Eh... not really? This server is owned by me and me only, so I can confirm that only this program is running during that period, if we ignore deamons and stuff that takes ~0.1% CPU time. Also, my server has 12 cores, so it could use the rest 11 cores to do other things. — nalzok, Mar 20 '19 at 16:43
@wildplasser I *could* remove that `volatile` and print the stuff in `node_num_by_depth` instead, as I did before the edit, but doing so will fill the screen will meaningless numbers irrelevant to my question. I would say the sole purpose of this `volatile` is to prevent the compiler from optimizing everything out (as these computations have no side effect). — nalzok, Mar 20 '19 at 16:59
The references to the array are aliased, so the compiler should know. — wildplasser, Mar 20 '19 at 17:01
@wildplasser I have no idea what "aliased" mean, but I assume you are saying it's safe to remove the `volatile` qualifier? Let me remove it and see if it makes any difference. — nalzok, Mar 20 '19 at 17:04
Oh, and the machines could have different caches, and the compilers could align things differently. — wildplasser, Mar 20 '19 at 17:08
@wildplasser Yeah that was my first suspect too! But after checking the specification, I realized my server has much larger caches than my MacBook Pro. I have edited the question to include more information. — nalzok, Mar 20 '19 at 17:26
Could it be that your MacBook pro is running with larger pages? You could be experiencing TLB thrashing if this is the case. Also if the cache hierarchy in the MacBook is much simpler, you could be experiencing the accumulated effect of the various misses on the extra cache levels. Since your memory access is mostly random over a large memory range, your caches must be innefective most of the time; the more cache levels, the longer it would take to hit main memory. — BlueStrat, Mar 20 '19 at 19:17
Rewording, my comment, you could be experiencing TLB thrashing IN YOUR SERVER if the virtual page size is configured to be smaller than in your MacBook. — BlueStrat, Mar 20 '19 at 19:52
@BlueStrat Paging could be a factor, but from my **Edit III** you can tell it didn't make so much difference. This could be used to explain the remaining 0.2x slowdown, though. (BTW: how can I get info about the TLB? It appears to be a rarely-mentioned piece of hardware...) — nalzok, Mar 21 '19 at 00:26
Glad to read that you found the bottleneck. Here is an excellent [series of articles](https://lwn.net/Articles/374424/) describing in great length how the TLB works, how it can be profiled and how using larger pages reduces TLB pressure at the cost of less granularity and more expensive virtual memory paging. — BlueStrat, Mar 21 '19 at 01:12

Program runs 10x slower on 2.5GHz Xeon than 2.3GHz Core i5

0 Answers0