3

I have a c++ program with multiple For loops; each one runs about 5 million iterations. Is there any command I can use with g++ to make the resulting .exe will use multiple cores; i.e. make the first For loop run on the first core and the second For loop run on the second core at the same time? I've tried -O3 and -O3 -ftree-vectorize, but in both cases, my cpu usage still only hovers at around 25%.

EDIT: Here is my code, in case in helps. I'm basically just making a program to test the speed capabilities of my computer.

#include <iostream>
using namespace std;
#include <math.h>
int main()
{

float *bob = new float[50102133];
float *jim = new float[50102133];
float *joe = new float[50102133];

int i,j,k,l;
//cout << "Starting test...";
for (i=0;i<50102133;i++)
    bob[i] = sin(i); 
for (j=0;j<50102133;j++)
    bob[j] = sin(j*j);
for (k=0;k<50102133;k++)
    bob[k] = sin(sqrt(k));
for (l=0;l<50102133;l++)
    bob[l] = cos(l*l);
cout << "finished test.";
cout << "the 100120 element is," << bob[1001200];

return 0;
}
user3368803
  • 51
  • 2
  • 5

5 Answers5

8

The most obviously choice would be to use OpenMP. Assuming your loop is one that's really easy to execute multiple iterations in parallel, you might be able to just add:

#pragma openmp parallel for

...immediately before the loop, and get it to execute in parallel. You'll also have to add -fopenmp when you compile.

Depending on the content of the loop, that may give anywhere from a nearly-linear speedup to slowing the code down somewhat. In the latter cases (slowdown or minimal speedup) there may be other things you can do with OpenMP to help speed it up, but without knowing at least a little about the code itself, it's hard to guess what to do or what improvement you may be able to expect at maximum.

The other advice you're getting ("Use threads") may be suitable. OpenMP is basically an automated way of putting threads to use for specific types of parallel code. For a situation such as you describe (executing multiple iterations of a loop in parallel) OpenMP is generally preferred--it's quite a bit simpler to implement, and may well give better performance unless you know multithreading quite well and/or expend a great deal of effort on parallelizing the code.

Edit:

The code you gave in the question probably won't benefit from multiple threads. The problem is that it does very little computation on each data item before writing the result out to memory. Even a single core can probably do the computation fast enough that the overall speed will be limited by the bandwidth to memory.

To stand a decent chance of getting some real benefit from multiple threads, you probably want to write some code that does more computation and less just reading and writing memory. For example, if we collapse your computations together, and do all of them on a single item, then sum the results:

double total = 0;

for (int i = 0; i < size; i++)
    total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);

By adding a pragma:

#pragma omp parallel for reduction(+:total)

...just before the for loop, we stand a good chance of seeing a substantial improvement in execution speed. Without OpenMP, I get a time like this:

Real    16.0399
User    15.9589
Sys     0.0156001

...but with the #pragma and OpenMP enabled when I compile, I get a time like this:

Real    8.96051
User    17.5033
Sys     0.0468003

So, on my (dual core) processor, time has dropped from 16 to 9 seconds--not quite twice as fast, but pretty close. Of course, a lot of the improvement you get will depend on exactly how many cores you have available. For example, on my other computer (with an Intel i7 CPU), I get a rather larger improvement because it has more cores.

Without OpenMP:

Real    15.339
User    15.3281
Sys     0.015625

...and with OpenMP:

Real    3.09105
User    23.7813
Sys     0.171875

For completeness, here's the final code I used:

#include <math.h>
#include <iostream>

static const int size = 1024 * 1024 * 128;
int main(){
    double total = 0;

#pragma omp parallel for reduction(+:total)
    for (int i = 0; i < size; i++)
        total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
    std::cout << total << "\n";
}
Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • Thank you! I tried your optimized code and I was able to run ~5 billion calculations in 30 seconds with optimizations (vs. almost 2 minutes without the -fopenmp), compared to ~5 million iterations per second with the original, memory-intensive program. – user3368803 Mar 02 '14 at 02:21
2

The compiler has no way to tell if your code inside the loop can be safely executed on multiple cores. If you want to use all your cores, use threads.

Ryp
  • 447
  • 4
  • 13
0

C++11 got support for threading but c++ compilers won't/can't do any threading on their own.

AliciaBytes
  • 7,300
  • 6
  • 36
  • 47
0

Use Threads or Processes, you may want to look to OpenMp

Jekyll
  • 1,434
  • 11
  • 13
0

As others have pointed out, you can manually use threads to achieve this. You might look at libraries such as libdispatch (aka. GCD) or Intel's TBB to help you do this with the least pain.

The -ftree-vectorize option you mention is for targeting SIMD vector processor units on CPUs such as ARM's NEON or Intel's SSE. The code produced is not thread-parallel, but rather operation parallel using a single thread.

The code example posted above is highly amenable to parallelism on SIMD systems as the body of each loop very obviously has no dependancies on the previous iteration, and the operations in the loop are linear.

On some ARM Cortex A series systems at least, you may need to accept slightly reduced accuracy to get the full benefits.

marko
  • 9,029
  • 4
  • 30
  • 46