CPU speed and threads in C++

Question

I have the following C++ program:

void testSpeed(int start, int end)
{
    int temp = 0;
    for(int i = start; i < end; i++)
    {
        temp++;
    }
}  

int main() {

    using namespace boost;

    timer aTimer;

    // start two new threads that calls the "testSpeed" function

    boost::thread my_thread1(&testSpeed,         0,  500000000);
    boost::thread my_thread2(&testSpeed, 500000000, 1000000000);

    // wait for both threads to finish
    my_thread1.join();
    my_thread2.join();

    double elapsedSec =  aTimer.elapsed();
    double IOPS = 1/elapsedSec;
}

So the idea is to test the CPU speed in terms of integer operations per second (IOPS). There are 1 billion iterations (operations), so on 1Ghz CPU we should get around billion integer operations per second, I believe.

My assumption is that more threads = more integer operations per second. But the more threads I try the less operations per second I see (I have more cores than threads). What may causing such a behavior? Is it the threads overhead? Maybe I should try a much longer experiment to see if the threads actually help?

Thank you!

UPDATE: So I changed the loop to run 18 billions times, and declared temp as volatile. Also added another testSpeed method with different name so now a single threaded executes both methods one after another, while two threads get each one method; so there shouldn't be any sync' issues, etc. And... still no change in behavior! single threaded is faster according to timer. Ahhh! I found the sucker, apparently timer is bluffing. The two threads take half the time to finish but timer tells me the single threaded run was two seconds faster. I'm now trying to understand why... Thanks everyone!

What is `numItr`? How do you test with more threads exactly? (There are a lot of subtle issues that could be going on here, we need to see the real code.) — David Schwartz, Feb 21 '12 at 23:06
How long is it running now? Have you compared single and double threaded runtimes? — Kerrek SB, Feb 21 '12 at 23:07
The compiler might notice that `testSpeed` has no side effects, and optimise it to return immediately. Try `volatile int temp` to force it to update it the specified number of times (and check the assembly to make sure it's doing what you think it should). — Mike Seymour, Feb 21 '12 at 23:09
Sorry guys, I removed numItr. The current code showing two threads, but I can also remove the second thread and give the first thread to run over the entire space (0 to billion). Thanks! — user247866, Feb 21 '12 at 23:11
"on 1Ghz CPU we should get around billion integer operations per second" not quite. You get 1 billion clock cycles, where each machine instruction takes `n` cycles (the x86 reference lists those, iirc), and that total time is split between `n` processes and the OS, and even your relatively optimized call has to do a `cmp, jnz, add` at least, so you probably won't get 1 billion additions. You will still get a helluva lot, and this should only take a few seconds. — ssube, Feb 21 '12 at 23:15
I agree with peachykeen, but it's actually even more subtle than that. Most systems these days are hyperthreaded, meaning that in addition to multiple cores, there are multiple pipelines within a core. This means that your code may run faster than expected. Also, context swapping takes time within a core. There's no guarantee that the OS will choose to put your previous calculation on the same core it was on previously, so you may need a trip through memory to get the context over, which would slow things down. And that even ignores normal OS interrupts taking time. — Joel, Feb 21 '12 at 23:23
Declaring temp as volatile didn't change the threading behavior. — user247866, Feb 21 '12 at 23:32

score 3 · Answer 1 · 2012-02-21T23:18:51.387

3

I am almost certain that compiler optimizes away your loops. Since you do not subtract the overhead of creating/synchronizing threads, you actually measure only that. So more threads you have, more overhead you create and more time it takes.

Overall, you can refer to the documentation of your CPU and find out about its frequency and how much ticks any given instruction takes. Testing it yourself using an approach like this is nearly impossible and is, well, useless. This is because of an overhead like context switches, transferring the execution from one CPU/core to another, scheduler swap-outs, branch mis-prediction. In real life you will also encounter cache misses and a lot of memory bus latency since there is no programs that fit into ~ 15 registers. So you better test a real program using some good profiler. For example, latest CPUs can give out CPU stall information, cache misses, branch mispredictions and a lot more. You can use a good profiler to actually decide when and how to parallel your program as well.

edited Feb 21 '12 at 23:18

answered Feb 21 '12 at 23:09

Interesting. What do you mean by "compiler optimizes away your loops"? – user247866 Feb 21 '12 at 23:16
@user247866 It sees that the loop counter is a constant and performs some nasty analysis. With a little bad luck you end up with it simply setting `temp` to `end`. – pmr Feb 21 '12 at 23:20
2

An optimizing compiler can detect that your function testSpeed() has no side effects. Because you only modify a local variable and never reference it, the compiler can look at that and see that testSpeed() has no side effects, and can effectively be substituted by a no-op with no change in your program's semantics. This is a fundamental danger of testing micro-benchmarks like this - you may not end up measuring what you think you're measuring. – aalpern Feb 21 '12 at 23:22
I guess you should also mention `volatile`, as one of OP commenters did. – Griwes Feb 21 '12 at 23:43
@user247866 See my answer [here](http://stackoverflow.com/questions/8841865/how-does-gcc-optimize-c-code). It goes into a bit more detail on this type of optimization. – Mysticial Feb 22 '12 at 00:17

bisarch · Answer 2 · 2022-05-06T17:45:54.897

As the number of threads increases beyond a certain point, it leads to an increase in the number of cache misses (cache is being shared among the threads), but at the same time memory access latency is being masked by a large number of threads(while a thread is waiting for data to be fetched from the memory, other threads are running). Hence there is a trade off. Here is an interesting paper on this subject.

According to this paper, on a multi-core machine when the number of threads is very low (of the order of number of cores), the performance will increase on increasing the number of threads, because now the cores are being fully utilized.

After that, a further increase in the number of threads leads to the effect of cache misses dominating, thus leading to a degradation in the performance.

If the number of threads become very large, such that the amount of cache storage per thread almost become almost zero, all memory accesses are made from the main memory. But at the same time increased number of threads is also very effectively masking the increased memory access latency. This time the second effect dominates leading to an increase in the performance.

Thus the valley in the middle is the region with the worst performance.

CPU speed and threads in C++

2 Answers2