C++ code is much slower on linux than on windows

Question

I'm writing simple program and I want to mesure time of its execution on Windows and Linux (both 64). I have a problem, because for 1 000 000 elements in table on Windows it takes about 35 seconds, while on linux it takes about 30 seconds for 10 elements. Why the difference is so huge? What am I doing wrong? Is there something in my code that is not proper on Linux?

Here is my code:

void fillTable(int s, int t[])
{
    srand(time(0));
    for (int i = 0; i < s; i++)
    {
        t[i] = rand();
    }
}
void checkIfIsPrimeNotParalleled(int size, int table[])
{
    for (int i = 0; i < size; i++)
    {
        int tmp = table[i];

        if (tmp < 2)
        {
        }


        for (int i = 2; i < tmp; i++)
        {
            if (tmp % i == 0)
            {
            }
            else
            {
            }
        }
    }
}
void mesureTime(int size, int table[], int numberOfRepetitions)
{
    long long sum = 0;
    clock_t start_time, end_time;
    fillTable(size, table);

    for (int i = 0; i < numberOfRepetitions; i++)
    {
        start_time = clock();

        checkIfIsPrimeNotParalleled(size, table); 

        end_time = clock();
        double duration = (end_time - start_time) / CLOCKS_PER_SEC;
        sum += duration;
    }
    cout << "Avg: " << round(sum / numberOfRepetitions) << " s"<<endl;
}

int main()
{

    static constexpr int size = 1000000; 
    int *table = new int[size];
    int numberOfRepetitions = 1;
    mesureTime(size, table, numberOfRepetitions);
    delete[] table;
    return 0;

}

and the makefile for Linux. On Windows I'm using Visual Studio 2015

.PHONY: Project1

CXX = g++
EXEC = tablut
LDFLAGS = -fopenmp
CXXFLAGS = -std=c++11 -Wall -Wextra -fopenmp -m64
SRC= Project1.cpp
OBJ= $(SRC:.cpp=.o)

all: $(EXEC)

tablut: $(OBJ)
    $(CXX) -o tablut $(OBJ) $(LDFLAGS)

%.o: %.cpp
    $(CXX) -o $@ -c $< $(CXXFLAGS) 

clean:
    rm -rf *.o

mrproper: clean
    rm -rf tablut

The main goal is to mesure time.

Any reason you don't let GCC optimize this code? (Like -03 option?) — JVApen, Jun 11 '17 at 15:48
When I run this project with O3, O2 or even O1 the time for 99999999 elements is 0 seconds — karo96, Jun 11 '17 at 15:50
Your function under test has no effect, the call to the function can be optimized away. — mch, Jun 11 '17 at 15:51
Micro-benchmarking is a little bit harder than just slapping a loop around some function. See Chandler Carruth's talk on youtube for example. — Baum mit Augen, Jun 11 '17 at 15:52
Possible duplicate of [Why is this C++ code execution so slow compared to java?](https://stackoverflow.com/questions/44342884/why-is-this-c-code-execution-so-slow-compared-to-java) — selbie, Jun 11 '17 at 15:59

score 2 · Answer 1 · answered Jun 11 '17 at 15:49

2

You are not building with optimization enabled on Linux. Add -O2 or -O3 to your compiler flags (CXXFLAGS) and you'll see a significant performance improvement.

answered Jun 11 '17 at 15:49

Jesper Juhl

30,449
3
47
70

When I run this project with O3, O2 or even O1 the time for 99999999 elements is 0 seconds – karo96 Jun 11 '17 at 15:50
8

@karo96 there you go. The compiler did its job and optimized your code away since it has no side-effects. – Jesper Juhl Jun 11 '17 at 15:53
3

I'd say 0 seconds is much faster than 35 seconds – Passer By Jun 11 '17 at 17:22

2785528 · Accepted Answer · 2017-06-11T18:11:38.797

Your code has a for loop set to 1,000,000 iterations. As noted by others, the compiler can optimize this loop away, such that you learn nothing.

A technique I use to work around the good-compiler-issue, is to replace the fixed-loop with a low cost time check.

In the following code snippet, I use chrono for duration measurements, and time(0) to check for end-of-test. Chrono is not the lowest cost time check I have found, but I think good-enough for how I am using it. std::time(0) measures to be about 5 ns (on my system), about the fastest I have measured.

// Note 7 - semaphore function performance
// measure duration when no thread 'collision' and no context switch
 void measure_LockUnlock()
    {
       PPLSem_t*    sem1 = new PPLSem_t;
       assert(nullptr != sem1);
       size_t     count1 = 0;
       size_t     count2 = 0;
       std::cout << dashLine << "  3 second measure of lock()/unlock()"
                 << " (no collision) " << std::endl;
       time_t t0 = time(0) + 3;

       Time_t start_us = HRClk_t::now();
       do {
          assert(0 == sem1->lock());   count1 += 1;
          assert(0 == sem1->unlock()); count2 += 1;
          if(time(0) > t0)  break;
       }while(1);
       auto  duration_us = std::chrono::duration_cast<US_t>(HRClk_t::now() - start_us);

       assert(count1 == count2);
       std::cout << report (" 'sem lock()+unlock()' ", count1, duration_us.count());

       delete sem1;
       std::cout << "\n";
    } // void mainMeasures_LockUnlock()

FYI - "class PPLSem_t" is 4-single-line-methods running a Posix Process Semaphore set to local mode (unamed, unshared).

The test above measures only the cost of method invocations, no context switches (notoriously slow) were invoked in this experiment.

But wait, you say ... don't one or the other of lock() and unlock() have side effects? Agreed. But does the compiler know that? It has to assume that they do.

So how do you make this useful?

Two steps. 1) Measure your lock/unlock performance. 2) Add the code of what is inside of your for loop (not the for-loop itself), into this lock/unlock loop, then measure the performance again.

The difference of these two measurements is the information you seek, and I think the compiler can not optimize it away.

The result of duration measurement on my older Dell, with Ubuntu 15.10, and g++v5.2.1.23, and -O3 is

  --------------------------------------------------------------
  3 second measure of lock()/unlock() (no collision) 
  133.5317660 M 'sem lock()+unlock()'  events in 3,341,520 us
  39.96138464 M 'sem lock()+unlock()'  events per second
  25.02415792 n sec per  'sem lock()+unlock()'  event

So this is about 12.5 nsec for one of each method, and achieved 133 10^6 iterations in about 3 seconds.

You can attempt to adjust the time to reach 1,000,000 iterations, or simply use the iteration count to jump out of the loop. (i.e. if count1 == 1,000,000) break; kind of idea)

Your assignment, should you choose to accept it, is to find a suitable simple and fast method (or two) with a side-effect (which you know will not happen), and add your code into that loop, and then run until the loop count is 1,000,000.

Hope this helps.

umm, maybe this is too much help, but my PPLSem_t code exists in one of my other SE answers. — 2785528, Jun 11 '17 at 17:30
You might try a timed-loop without using a method with side effect ... it works because the compiler can not guess your systems performance. — 2785528, Jun 11 '17 at 18:04
oops ... noticed my chrono typedefs are missing. See my answer here: https://stackoverflow.com/a/44467595/2785528 for a copy — 2785528, Jun 12 '17 at 02:27

C++ code is much slower on linux than on windows

2 Answers2