Time spent for the serial code differs as it is run in separate or together with parallel code

Question

The program mentioned below calculates the vector-vector dot product with sequential, CPU parallel (using OpenMP) and GPU parallel (Cuda) approaches. The following code segments show how the each of these functions are invoked and how the elapsed time is calculated.

#define SEQUENTIAL          "-s"
#define PARALLEL            "-p"
#define CUDA                "-c"
#define VERIFY              "-v"
#define TEST_AND_COMPARE    "-t"

#define GET_TIME(x); if (clock_gettime(CLOCK_MONOTONIC, &(x)) < 0)  {   perror("clock_gettime( ):");exit(EXIT_FAILURE);}

int main(int argc, char **argv) {

    struct timespec t1, t2, t3, t4;
    unsigned long sec, nsec;
    float comp_time;

    //invoking the sequential version
    if (!strcmp(argv[1], SEQUENTIAL)) {
        GET_TIME(t1);
        sequentialVersion();
        GET_TIME(t2);
        comp_time = elapsed_time_msec(&t1, &t2, &sec, &nsec);
        printf("N=%d: Time(ms)=%.5f \n", N, comp_time);
    }

    //invoking the parallel version
    else if (!strcmp(argv[1], PARALLEL)) {
        noOfThreads = atoi(argv[2]);
        GET_TIME(t1);
        parallelVersion();
        GET_TIME(t2);
        comp_time = elapsed_time_msec(&t1, &t2, &sec, &nsec);
        printf("N=%d: Threads=%d: Time(ms)=%.5f \n", N, noOfThreads,
                comp_time);
    }

    //the cuda invoke goes here...

    //comparing the answers received by each method of calculation
    else if (!strcmp(argv[1], TEST_AND_COMPARE)) {

        precision answer1, answer2, answer3;

        GET_TIME(t1);
        answer1 = sequentialVersion();
        GET_TIME(t2);
        comp_time = elapsed_time_msec(&t1, &t2, &sec, &nsec);
        printf("%-10s\tN=%d: Ans=%f: Time(ms)=%.5f \n", "Serial", N, answer1, comp_time);

        noOfThreads = atoi(argv[2]);
        GET_TIME(t3);
        answer2 = parallelVersion();
        GET_TIME(t4);
        comp_time = elapsed_time_msec(&t3, &t4, &sec, &nsec);
        printf("%-10s\tN=%d: Ans=%f: Time(ms)=%.5f Threads=%d \n", "Parallel",  N, answer2, comp_time, noOfThreads);
    }
}

float elapsed_time_msec(struct timespec *begin, struct timespec *end,
        unsigned long *sec, unsigned long *nsec) {
    if (end->tv_nsec < begin->tv_nsec) {
        *nsec = 1000000000 - (begin->tv_nsec - end->tv_nsec);
        *sec = end->tv_sec - begin->tv_sec - 1;
    } else {
        *nsec = end->tv_nsec - begin->tv_nsec;
        *sec = end->tv_sec - begin->tv_sec;
    }
    return (float) (*sec) * 1000 + ((float) (*nsec)) / 1000000;
}

The Makefile for the above mentioned program is as follows.

#specifying single or double precision
ifeq ($(double),)
    precision= 
else
    precision=-D USE_DOUBLES
endif

#specifying the problem size
ifeq ($(N),)
    problem-size=-D PROBLEM_SIZE=1000000
else
    problem-size=-D PROBLEM_SIZE=${N}
endif

dot:
    nvcc dot-product.cu -arch compute_11 -Xcompiler -fopenmp -O3 $(problem-size) $(precision) -o prog

The code is compiled as make dot with the default N, and run with ./prog -s the output is shown as

`N=1000000: Time(ms)=0.00010`

But with the same N, when the program is run with ./prog -t 6 the serial time consumption shows expected behaviour as shown below

Serial      N=1000000: Ans=2249052.500000: Time(ms)=2.19174 
Parallel    N=1000000: Ans=2248955.500000: Time(ms)=0.53915 Threads=6 
Cuda        N=1000000: Ans=2248959.750000: Time(ms)=0.09935

Why is it behaving like this?

"`#define GET_TIME(x); ...`" Please revisit your C book. This does not what you think. The `if` statement inside is also a bad idea. Don't use macro an function (possibly `inline`) will do the same job. — too honest for this site, Jan 01 '16 at 20:51
I don't see anything related to CUDA in this question. Or any CUDA code. Why is this tagged with the CUDA tag? And do you have a problem with your Makefile? If not, why is this tagged with makefile? — talonmies, Jan 01 '16 at 22:49

score 2 · Accepted Answer · edited May 23 '17 at 12:04

Although it's better if you provide a complete code, I believe the explanation for the difference in timing when you run the SEQUENTIAL test (-s) vs. the same function in the TEST_AND_COMPARE case (-t) is due to how you are invoking the sequentialVersion() function in each case, and the fact that you have specified aggressive compiler optimization (-O3).

Here is a worked test case, that demonstrates approximately the same difference in behavior:

$ cat t1017.cu
#include <time.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>

#define N 1000000


#define SEQUENTIAL          "-s"
#define PARALLEL            "-p"
#define CUDA                "-c"
#define VERIFY              "-v"
#define TEST_AND_COMPARE    "-t"

#define GET_TIME(x); if (clock_gettime(CLOCK_MONOTONIC, &(x)) < 0)  {   perror("clock_gettime( ):");exit(EXIT_FAILURE);}

typedef float precision;

precision sequentialVersion() {precision retval = 0.0f; for (int i=0; i<N; i++) retval += (precision)i; return retval; }
precision parallelVersion()   {sleep(1); return 0.0f;};
float elapsed_time_msec(struct timespec *begin, struct timespec *end,
        unsigned long *sec, unsigned long *nsec) {
    if (end->tv_nsec < begin->tv_nsec) {
        *nsec = 1000000000 - (begin->tv_nsec - end->tv_nsec);
        *sec = end->tv_sec - begin->tv_sec - 1;
    } else {
        *nsec = end->tv_nsec - begin->tv_nsec;
        *sec = end->tv_sec - begin->tv_sec;
    }
    return (float) (*sec) * 1000 + ((float) (*nsec)) / 1000000;
}


int main(int argc, char **argv) {

    struct timespec t1, t2, t3, t4;
    unsigned long sec, nsec;
    float comp_time;
    int noOfThreads;

    //invoking the sequential version
    if (!strcmp(argv[1], SEQUENTIAL)) {
        GET_TIME(t1);
        sequentialVersion();
        GET_TIME(t2);
        comp_time = elapsed_time_msec(&t1, &t2, &sec, &nsec);
        printf("N=%d: Time(ms)=%.5f \n", N, comp_time);
    }

    //invoking the parallel version
    else if (!strcmp(argv[1], PARALLEL)) {
        noOfThreads = atoi(argv[2]);
        GET_TIME(t1);
        parallelVersion();
        GET_TIME(t2);
        comp_time = elapsed_time_msec(&t1, &t2, &sec, &nsec);
        printf("N=%d: Threads=%d: Time(ms)=%.5f \n", N, noOfThreads,
                comp_time);
    }

    //the cuda invoke goes here...

    //comparing the answers received by each method of calculation
    else if (!strcmp(argv[1], TEST_AND_COMPARE)) {

        precision answer1, answer2, answer3;

        GET_TIME(t1);
        answer1 = sequentialVersion();
        GET_TIME(t2);
        comp_time = elapsed_time_msec(&t1, &t2, &sec, &nsec);
        printf("%-10s\tN=%d: Ans=%f: Time(ms)=%.5f \n", "Serial", N, answer1, comp_time);

        noOfThreads = atoi(argv[2]);
        GET_TIME(t3);
        answer2 = parallelVersion();
        GET_TIME(t4);
        comp_time = elapsed_time_msec(&t3, &t4, &sec, &nsec);
        printf("%-10s\tN=%d: Ans=%f: Time(ms)=%.5f Threads=%d \n", "Parallel",  N, answer2, comp_time, noOfThreads);
    }
}


$ nvcc -o t1017 t1017.cu
$ ./t1017 -s
N=1000000: Time(ms)=3.61435
$ nvcc -O3 -o t1017 t1017.cu
$ ./t1017 -s
N=1000000: Time(ms)=0.00068
$ ./t1017 -t 6
Serial          N=1000000: Ans=499940360192.000000: Time(ms)=1.40843
Parallel        N=1000000: Ans=0.000000: Time(ms)=1000.16150 Threads=6
$

Note that when the code is compiled with no optimization specified, the timing in the -s case is a few milliseconds. When we compile in the -O3 case, the timing is approximately zero in the -s test, but is still a few milliseconds in the -t test.

To fix this, simply assign (do not ignore) the return value of the sequentialVersion() function to a variable wherever you use it. Instead of this:

    sequentialVersion();

do this:

    precision temp = sequentialVersion();

You may also want to print out or otherwise "use" the temp value later. By doing so, the compiler is unable to optimize away the sequential code.

As pointed out already, this issue has nothing to do with CUDA. You could take the code I have shown, place it in a .cpp file instead of a .cu file, and compile with g++ instead of nvcc, and witness the same characteristics. Since the code has no device code in it, nvcc will simply hand it off to the host compiler anyway.

Macro disclaimer:

While there may be stylistic issues and/or hazards with the use of the particular timing macro you have shown, I believe that:

it does not impact the issue you are actually asking about in this question
it does actually work the way I think it does, when used in this particular way in this particular program.

Since I don't believe that particular macro is impacting this particular issue, I've chosen to leave it as-is to demonstrate that the issue can be fixed (in this test case) with no modification to the timing macro. When I choose to do host-based timing, I typically use an ordinary function such as I have demonstrated here. If you have questions about issues associated with that particular macro, you might want to ask it as a separate question. I don't think that would need to be tagged with cuda.

I removed the unnecessary tag and tried out the solution you explained. When the returned value from the `sequentialVersion()` is assigned to a variable and used later, program takes more time than the parallel version. And I didn't post the entire code since it would look bulky here, thought the provided code would suffice in finding the issue. — Rajith Gun Hewage, Jan 02 '16 at 03:51

Time spent for the serial code differs as it is run in separate or together with parallel code

1 Answers1