Threading as "slow" as non threaded

Question

Today I had an issue calculating prime numbers with threads in python. It was nearly as slow as without a thread(See Question).

Now I created the same code thinking the python problem would not exist there in C using pthread.

#include <stdio.h>
#include <time.h>
#include <pthread.h>

int isPrime(int number) {
    int i;
    for (i=2; i<number; i++) {
        if (number % i == 0 && i != number) return 0;
    }
    return 1;
}

void calcPrimeNumbersFromNtoM(int n, int m){
    for (int i = n; i <= m; i++) {
        if (isPrime(i)) {
            //printf("%i\n",i);
        }
    }

}

void *calcFirstHalf(){
    calcPrimeNumbersFromNtoM(1,5000);
    return NULL;
}

void *calcSecondHalf(){
    calcPrimeNumbersFromNtoM(5001,10000);
    return NULL;
}

void calcThreadedPrimenumbers(){
    pthread_t t1, t2;
    pthread_create(&t1, NULL, calcFirstHalf, NULL);
    pthread_create(&t2, NULL, calcSecondHalf, NULL);

    //wait for the threads to finish
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
}

int main(int argc, const char * argv[])
{

    clock_t startNT, endNT,startT, endT;
    double cpu_time_usedNT,cpu_time_usedT;
    startNT = clock();
    calcPrimeNumbersFromNtoM(1, 10000);
    endNT = clock();
    cpu_time_usedNT = ((double) (endNT - startNT)) / CLOCKS_PER_SEC;

    startT = clock();
    calcThreadedPrimenumbers();
    endT = clock();
    cpu_time_usedT = ((double) (endT - startT)) / CLOCKS_PER_SEC;


    printf("--------Results-----------\n");
    printf("Non threaded took: %f secs\n",cpu_time_usedNT);
    printf("Threaded took: %f secs\n",cpu_time_usedT);


    return 0;
}

The result is that threading is again as slow as non threaded:

--------Results-----------
Non threaded took: 0.020624 secs
Threaded took: 0.027257 secs

This is confusing me a lot. Is there something wrong with my code? Is it true that threads are not necessary faster than using no thread? If yes what is the explanation for this?

Is this caused by the OS needing to schedule the same task only divided in two parts which leads to the same amount of time?

Maybe it is important: I use an 2.6Ghz Core i5 MacBook with OSX 10.9

Your problem might not be large enough to show a significant difference. Try calculating a much larger number of primes. — Mark Ransom, Apr 28 '14 at 22:13
Calculation from 1 to 1000000 results in: `--------Results----------- Non threaded took: 137.332219 secs Threaded took: 140.151069 secs` So this can't be the issue — arnoapp, Apr 28 '14 at 22:19
Threads do not make up for bad algorithms. A decent Public Domain version is here: http://nothings.org/stb/stb_h.html — technosaurus, Apr 28 '14 at 22:35
... no need to check even numbers past 2, no need to increment i past number/2 ... just that would make it 4x faster, but you could also exclude multiples of 3 past 3 etc... for moderate improvements. To balance the threads you should split them unevenly so that 1 gets the lower ~2/3 and the other gets the remaining ~1/3. — technosaurus, Apr 28 '14 at 22:53
@technosaurus No need to increment past *sqrt(number)*, actually. — Niklas B., Apr 28 '14 at 23:16
@NiklasB. for the larger number thread that would be useful, but for smaller numbers sqrt takes a lot more cycles than it saves (on typical platforms). — technosaurus, Apr 28 '14 at 23:40
Parallelizing such a bad algorithm makes no sense at all. As amit's answer states correctly, because of the bad asymptotics, the parallel version is completely dominated by the second half. Only if you have a better base algorithm that hasn't the complexity dominate for larger numbers you will ever see an effect of parallelization. — Jens Gustedt, Apr 28 '14 at 23:41
@technosaurus You don't actually need to compute the square root. One multiplication per loop will do the trick. Keep track of the `root` in a variable. On each pass through the loop `if (root * root < number)` then increment the root. — user3386109, Apr 29 '14 at 00:12
@technosaurus You can also just do `for (i = 2; i*i <= number; ++i)` By the way I refuse to believe that sqrt would actually be slower than that. I cannot imagine it takes more than a few cycles on any given modern CPU, so unless number is really, really small (in the order of dozens), the costly integer division operations inside the loop will dominate the cost — Niklas B., Apr 29 '14 at 00:23
@niklas certainly not on any modern x86 cpu (I think it has a latency of about 14 cycles on Sandy bridge), but maybe on some strange old ARM softcore. — Voo, Apr 29 '14 at 07:05
@Voo but even there the modulus operation might be very costly as well and is used much more often :) — Niklas B., Apr 29 '14 at 14:18
@Niklas Well I can certainly imagine some software emulation layer that has "optimized" modulo but does the sqrt using some non-optimized newton's method ;-) As I said, certainly not true on any modern "normal" CPU. — Voo, Apr 29 '14 at 14:23

score 6 · Answer 1 · answered Apr 28 '14 at 22:23

6

Your primes calculator is O(n^2). Note that 5000^2 = 25000000, while (10,000^2)/2 = 50000000.

This makes the second thread the bottleneck of the algorithm, and is waiting a significant amount of time for the first one.
In other words, the first thread is doing very little work, compared to the second one, and thus the first is idling for most of the work.

answered Apr 28 '14 at 22:23

amit

175,853
27
231
333

Thats a good point but it should be the same problem for the non threaded alternative. So if there could be an increase of time then it should be in the lower bounds which are processed concurrently – arnoapp Apr 28 '14 at 22:35
To be exact, the run time is `O(n^2-nlogn)` (which is still very close to O(n^2). This is because the average run time per iteration is `1/2+2/3+...+(n-1)/n = 1-1/2+1-1/3+1-1/4+...+1-1/n = n- (1/2+1/3+..+1/n) = n-Hn` In here, `Hn` is the harmonic number, and is in `O(logn)`. – amit Apr 28 '14 at 23:20
@amit: as `nlogn` is smaller than `n^2` for any decent sized `n`, it gets dropped from big-O notation, so `O(n^2)` is still correct. – Mooing Duck Apr 28 '14 at 23:30
@MooingDuck Yeap, of course you are correct. Still, I like this analysis of the time complexity so I'm going to leave it here, with your extra comment to clarify things up. – amit Apr 28 '14 at 23:31
Souldn't you still see about 33÷ speedup? I'm not convinced by this answer – Niklas B. Apr 29 '14 at 14:21

pmdj · Answer 2 · 2014-04-28T22:32:56.503

2

clock() returns CPU time. If you're using 2 CPUs concurrently for 1 second, clock() will increase by 2. You will want to measure wall time (actual elapsed real world time) instead. Also, as other answerers have said, your thread loads are imbalanced, so one thread will run for much longer than the other, although total wall time should still only be a little over 75% of the single-threaded case. (for a sufficiently long workload)

edited Apr 28 '14 at 22:32

answered Apr 28 '14 at 22:27

pmdj

22,018
3
52
103

How can I measure wall time? – arnoapp Apr 28 '14 at 22:34
@AzzUrr1 e.g. `time()` - see, for example [C: using clock() to measure time in multi-threaded programs](http://stackoverflow.com/questions/2962785/c-using-clock-to-measure-time-in-multi-threaded-programs) – pmdj Apr 28 '14 at 22:41

score 1 · Answer 3 · answered Apr 28 '14 at 22:22

1

I think you'll find that your isPrime function is O(n), so the second half with large n will dominate the overall timings. You should time both halves individually for the unthreaded test.

answered Apr 28 '14 at 22:22

Mark Ransom

299,747
42
398
622

@amit the way the loop is structured, a prime number will run all the way to `n`. – Mark Ransom Apr 28 '14 at 23:04
I removed my comment, I was talking about average case, I thought that the average run time is O(1/2+1/3+...+1/n), but it was a mistake. It should be `O(1/2+2/3+...+(n-1)/n)`. This is in `O(n-log(n))`, so both of us were pretty much correct in the first place. – amit Apr 28 '14 at 23:18

Mahonri Moriancumer · Accepted Answer · 2014-04-28T22:33:44.900

Specifically addressing your (general) questions

Is it true that threads are not necessary faster than using no thread? 
If yes what is the explanation for this?

The efficiency of using multiple threads to accomplish a task is limited, primarily, by the number of CPU cores (including hyper-threading where available). For example, if your system has two cores, then two threads can run at the same time. In your case (i5), you may have a 2-core, or 4-core processor. With hyper-threading, your system can run 4 or 8 threads at the same time.

Where your application appears to have only two threads (three, including the parent 'main()' thread), there should be a notable improvement. However, keep in mind that your threads are not the only ones active on your system. Likely, there are many threads of execution on your machine already; all competing for CPU resources.

As a CPU resource becomes available, the thread scheduler pulls another thread from a the queue of threads waiting for a CPU. It is unlikely that one of your threads will always be at the top of the run-queue. Hence, they will continue to wait for their turn in the run-queue.

Each time your code calls a 'blocking' function, the thread's context is stored in memory, and the thread is returned to the run-queue. Even innocent functions like 'printf()', which may block, will cause the thread to be returned to the run-queue.

Often, peer threads compete for resources other than CPU resources; such as shared memory, shared file access, etc. Generally these resources are protected by semaphores, locks, etc. This can also impact the efficiency of multiple threads vs a single thread.

These, and many other factors (including that mentioned by Mark Ransom) may have an effect on the timing results.

This is a generic answer about threads and have nothing to do with the specific question. — amit, Apr 28 '14 at 22:26

user3386109 · Answer 5 · 2014-04-28T23:19:33.840

You can load-balance your threads by partitioning the work differently. Note that 2 is the only even prime, so give each thread half of the odd numbers with code like this

void *calcFirstHalf()
{
    int i;
    for ( i = 1; i < 1000000; i += 4 )  // 1, 5, 9, 13...
       if ( isPrime( i ) )
       {
       }
    return NULL;
}

void *calcSecondHalf()
{
    int i;
    for ( i = 3; i < 1000000; i += 4 )  // 3, 7, 11, 15...
       if ( isPrime( i ) )
       {
       }
    return NULL;
}

Side note: you can also improve the efficiency of the isPrime function by only checking factors up to the square root of the proposed prime, since every non-prime must have at least one factor that is less than or equal to the square root.

Doing performance measurements on a MAC

The high-precision timer on a MAC is accessed through the mach_absolute_time function, as demonstrated by the code below.

#include <mach/mach.h>
#include <mach/mach_time.h>

void testTimer( void )
{
    uint64_t start, end;
    mach_timebase_info_data_t info;

    mach_timebase_info( &info );
    printf( "numer=%u denom=%u\n", info.numer, info.denom );

    start = mach_absolute_time();
    sleep( 1 );
    end = mach_absolute_time();

    printf( "%llu\n", end - start );
}

Note that the precision of the timer is not a fixed value, but must be calculated based on the information returned from the mach_timebase_info function. The calculation is

timer_rate = 1Ghz * numer / denom

You can confirm the timer rate by calling sleep for one second to see approximately how many ticks you get per second.

Just checked the man page for `clock`. It says that `clock` measures the CPU time used by a process. That's problematic for your experiment since multi-threading doesn't reduce the overall CPU time. Multi-threading will reduce wall-clock time if multiple CPUs are available. — user3386109, Apr 28 '14 at 23:01
@AzzUrr1 I've added some information on how to do precision time measurements on a MAC. — user3386109, Apr 28 '14 at 23:20

Threading as "slow" as non threaded

5 Answers5