82

This is quite an interesting question so let me set the scene. I work at The National Museum of Computing, and we have just managed to get a Cray Y-MP EL super computer from 1992 running, and we really want to see how fast it can go!

We decided the best way to do this was to write a simple C program that would calculate prime numbers and show how long it took to do so, then run the program on a fast modern desktop PC and compare the results.

We quickly came up with this code to count prime numbers:

#include <stdio.h>
#include <time.h>

void main() {
    clock_t start, end;
    double runTime;
    start = clock();
    int i, num = 1, primes = 0;

    while (num <= 1000) { 
        i = 2; 
        while (i <= num) { 
            if(num % i == 0)
                break;
            i++; 
        }
        if (i == num)
            primes++;

        system("clear");
        printf("%d prime numbers calculated\n",primes);
        num++;
    }

    end = clock();
    runTime = (end - start) / (double) CLOCKS_PER_SEC;
    printf("This machine calculated all %d prime numbers under 1000 in %g seconds\n", primes, runTime);
}

Which on our dual core laptop running Ubuntu (The Cray runs UNICOS), worked perfectly, getting 100% CPU usage and taking about 10 minutes or so. When I got home I decided to try it on my hex-core modern gaming PC, and this is where we get our first issues.

I first adapted the code to run on Windows since that is what the gaming PC was using, but was saddened to find that the process was only getting about 15% of the CPU's power. I figured that must be Windows being Windows, so I booted into a Live CD of Ubuntu thinking that Ubuntu would allow the process to run with its full potential as it had done earlier on my laptop.

However I only got 5% usage! So my question is, how can I adapt the program to run on my gaming machine in either Windows 7 or live Linux at 100% CPU utilisation? Another thing that would be great but not necessary is if the end product can be one .exe that could be easily distributed and ran on Windows machines.

Thanks a lot!

P.S. Of course this program didn't really work with the Crays 8 specialist processors, and that is a whole other issue... If you know anything about optimising code to work on 90's Cray super computers give us a shout too!

Mysticial
  • 464,885
  • 45
  • 335
  • 332
bag-man
  • 1,423
  • 2
  • 19
  • 23
  • 1
    Have you tried multi-threading? – Mysticial Feb 11 '12 at 22:13
  • Was your gaming PC multi-core? – Chriseyre2000 Feb 11 '12 at 22:13
  • 8
    I can't believe there's not a *unicos* tag. ;) – Edward Thomson Feb 11 '12 at 22:15
  • 32
    It is a strange thing that this single thread program took 100% of CPU usage on DUAL CORE processor ))) – mikithskegg Feb 11 '12 at 22:15
  • My laptop is Dual Core, my PC is Hex Core. @mikithskegg well in top that process was at 100%, I take it that isn't right then? – bag-man Feb 11 '12 at 22:16
  • Are you sure that your laptop is really dual core and not using something like hyperthread technology? – mikithskegg Feb 11 '12 at 22:18
  • In windows, open the task manager, go to processes, right click the process and set its priority to real-time. Does that increase the CPU usage? – Diego Feb 11 '12 at 22:19
  • how did you even measure the CPU usage that thing just prints out `This machine calculated all 168 prime numbers under 1000 in 0.002 seconds` on my aging laptop – PeterT Feb 11 '12 at 22:25
  • 1
    `void main()` should be `int main(void)`. Cray's C compiler doesn't support C99; I'm surprised that it accepted mixed declarations and statements. You need `#include ` to make the declaration of `system()` visible -- or you could just delete the `system("clear")`, which seems unnecessary. (Printing `'\r'` should go to the beginning of the current line.) Crays are optimized for floating-point performance; something that does integer calculations doesn't really exercise its strengths. – Keith Thompson Feb 12 '12 at 02:00
  • If you want a single-threaded benchmark, consider something like (Whetstone)[http://en.wikipedia.org/wiki/Whetstone_(benchmark)]. – Keith Thompson Feb 12 '12 at 02:00
  • 24
    Am I the only one who doesn't find this question interesting at all? Come one, running a single threaded program on a n-core machine and asking why it uses 1/n of the the cpu is just... never mind, I just downvote :-) – Gunther Piez Feb 12 '12 at 02:04
  • 17
    @drhirsch Well, the question shows research effort. I +1'ed for that - even if the OP is missing something fundamental about multi-core computing. – Mysticial Feb 12 '12 at 02:14
  • out of scope: you should check only i <= sqrt(num) – Nir Alfasi Feb 12 '12 at 07:04
  • 10
    @drhirsch There are a lot of uninteresting questions on the site. However, interesting or not is subjective. He may be missing the fundamentals and that isn't subjective. Like Mystical said, it does show research effort and isn't as easy to answer as it would appear. – Carl Feb 12 '12 at 23:48
  • 1
    Benchmarking a many-core machine using a single thread is not the only thing that makes the approach somewhat nonsensical, but also using a problem that is embarrassingly bad suited for the architecture. Cray computers drew much of their compute power from several (3 or 4 if I recall correctly) deeply pipelined vector processors per CPU core. The benchmarking code is strictly scalar. On a problem that is well-suited (and well parallelized), the Cray would likely perform anywhere from 50 to 100 times better. – Damon Jun 20 '13 at 07:31
  • 1
    Finding all primes up to 1,000 shouldn't even take measurable time. Whatever displays "x% CPU" probably doesn't notice it. And then there is a difference between how Windows and MacOS X display CPU usage: Windows displays 100% if _all_ CPUs are fully used. MacOS X shows 100% if _one_ CPU is fully used, so a single thread can easily display 100%, and on a quad core with multithreading the display will go close to 800%. Just a different display. – gnasher729 Mar 15 '14 at 08:14
  • @Damon Good point. As I recall, the Cray relied more on pipelining while the competing Cyber machines relied more on parallelization. Having a test each time through the loop will likely cause pipeline stalls and make the Cray run suboptimally unless you optimize specifically for the Cray's branch prediction. – Warren Dew Jun 05 '16 at 01:55

10 Answers10

86

If you want 100% CPU, you need to use more than 1 core. To do that, you need multiple threads.

Here's a parallel version using OpenMP:

I had to increase the limit to 1000000 to make it take more than 1 second on my machine.

#include <stdio.h>
#include <time.h>
#include <omp.h>

int main() {
    double start, end;
    double runTime;
    start = omp_get_wtime();
    int num = 1,primes = 0;

    int limit = 1000000;

#pragma omp parallel for schedule(dynamic) reduction(+ : primes)
    for (num = 1; num <= limit; num++) { 
        int i = 2; 
        while(i <= num) { 
            if(num % i == 0)
                break;
            i++; 
        }
        if(i == num)
            primes++;
//      printf("%d prime numbers calculated\n",primes);
    }

    end = omp_get_wtime();
    runTime = end - start;
    printf("This machine calculated all %d prime numbers under %d in %g seconds\n",primes,limit,runTime);

    return 0;
}

Output:

This machine calculated all 78498 prime numbers under 1000000 in 29.753 seconds

Here's your 100% CPU:

enter image description here

alex
  • 479,566
  • 201
  • 878
  • 984
Mysticial
  • 464,885
  • 45
  • 335
  • 332
  • Wow that didn't take you long! I forgot to say the end value we will use is 1,000,000 1000 was for testing (Which as you saw was just silly!) I will test this out now, thanks a lot! – bag-man Feb 11 '12 at 22:30
  • Depending on your compiler, there's a compiler flag you have to enable to get OpenMP support. – Mysticial Feb 11 '12 at 22:32
  • And also one can use very useful function `omp_get_wtime()` to measure time. – mikithskegg Feb 11 '12 at 22:32
  • What did you use to compile this? I am using MingGW's gcc – bag-man Feb 11 '12 at 22:43
  • I think this will be very hard to port to the Cray. Better stick to `pthread`s if you have to use threads, or better yet - `fork`. – cha0site Feb 11 '12 at 22:44
  • @OwenGarland I use MSVC with /openmp. With GCC, I think the flag is `-fopenmp`. – Mysticial Feb 11 '12 at 22:44
  • @cha0site OpenMP is as simple as it's going to get. For these distributed machines, you'll need MPI of some sort. But this program isn't memory bound at all, so it'll probably run well on any (even emulated) shared-memory architecture. – Mysticial Feb 11 '12 at 22:45
  • If you are looking for benchmark data then this will generate `This machine calculated all 78498 prime numbers under 1000000 in 173.203 seconds` on an Intel i3 M330 @2.13GHz which is a mobile processor on the low-end from a few years ago. – PeterT Feb 11 '12 at 22:47
  • The tricky part with porting this to anything other than OpenMP is that the loop iterations aren't even. They get more expensive as `num` increases and whether or not it's divisible by small numbers. So you'll need to implement some sort of dynamic scheduling with whatever approach you use. – Mysticial Feb 11 '12 at 22:49
  • @Mysticial: But he's likely not to have OpenMP on that machine, and _all_ he wants to do is peg the CPU - he doesn't actually care about the prime calculation side of things. You can do that easily enough with the primitives. – cha0site Feb 11 '12 at 22:50
  • 1
    @cha0site Yes, I mainly answered the question for the gaming machine. There are definitely more interesting ways to peg the CPU. One of the more notorious benchmarks I've done is my answer to [this question](http://stackoverflow.com/q/8389648/922184) - which overheated 2 of 4 machines I tested. – Mysticial Feb 11 '12 at 22:54
  • @Mysticial: Ah, very interesting indeed. However, that's probably still not what we want here - I think what the asker wants is POSIXy code which he can run both on his Linux machine and the Cray (which seems to run a POSIXy OS) and do some very rough comparisons. It's a bit late for me to write code, but I'll try to come up with something tomorrow ^__^ – cha0site Feb 11 '12 at 23:00
  • 1
    @Mystical Offtopic: What hardware are you running? My Hex-Core AMD @ 3.2Ghz did it in 92 seconds... – bag-man Feb 11 '12 at 23:15
  • 1
    @Owen: He has a Core i7 2600K... I'm jealous. – cha0site Feb 11 '12 at 23:23
  • Amazing, my 1 and abit year old CPU that I thought was amazing, is three times as slow as one that is pretty much the same price I paid for it! – bag-man Feb 11 '12 at 23:25
  • Core i7 920 @ 3.5GHz (overclocked) I also have a Core i7 2600K, but it's not the machine I usually do work on. I use it mainly for long running code. – Mysticial Feb 11 '12 at 23:38
  • @OwenGarland As for why it's 3x slower. This particular benchmark isn't very representative of real-life performance. All it does are integer divisions. I just checked [Agner Fog's tables](http://www.agner.org/optimize/instruction_tables.pdf), and it looks like AMD's integer division is a lot slower than Intel's. So that probably explains it. – Mysticial Feb 12 '12 at 00:09
  • @Mystical This has been very interesting, thanks a lot for your help! Under Windows the application needs to dll's to run by the way; I have no idea how you combine those into one exe :/ – bag-man Feb 12 '12 at 00:46
  • 20
    Augh! Too... much... pink! – Mateen Ulhaq Feb 12 '12 at 03:54
  • You guys been saying, "In Parallel", what does it mean? could someone enlighten me about this. – Mohammad Fadin Feb 13 '12 at 20:04
  • 2
    @MohammadFadin http://en.wikipedia.org/wiki/Parallel_computing Basically, you need to be able to process multiple tasks in parallel to be able to utilize a multi-core computer. – Mysticial Feb 13 '12 at 20:06
  • @Mysticial Can you solve [this](http://stackoverflow.com/q/34435282/2404470) confusion please – Zameer Ansari Dec 23 '15 at 12:26
24

You're running one process on a multi-core machine - so it only runs on one core.

The solution is easy enough, since you're just trying to peg the processor - if you have N cores, run your program N times (in parallel, of course).

Example

Here is some code that runs your program NUM_OF_CORES times in parallel. It's POSIXy code - it uses fork - so you should run that under Linux. If what I'm reading about the Cray is correct, it might be easier to port this code than the OpenMP code in the other answer.

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>

#define NUM_OF_CORES 8
#define MAX_PRIME 100000

void do_primes()
{
    unsigned long i, num, primes = 0;
    for (num = 1; num <= MAX_PRIME; ++num) {
        for (i = 2; (i <= num) && (num % i != 0); ++i);
        if (i == num)
            ++primes;
    }
    printf("Calculated %d primes.\n", primes);
}

int main(int argc, char ** argv)
{
    time_t start, end;
    time_t run_time;
    unsigned long i;
    pid_t pids[NUM_OF_CORES];

    /* start of test */
    start = time(NULL);
    for (i = 0; i < NUM_OF_CORES; ++i) {
        if (!(pids[i] = fork())) {
            do_primes();
            exit(0);
        }
        if (pids[i] < 0) {
            perror("Fork");
            exit(1);
        }
    }
    for (i = 0; i < NUM_OF_CORES; ++i) {
        waitpid(pids[i], NULL, 0);
    }
    end = time(NULL);
    run_time = (end - start);
    printf("This machine calculated all prime numbers under %d %d times "
           "in %d seconds\n", MAX_PRIME, NUM_OF_CORES, run_time);
    return 0;
}

Output

$ ./primes 
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
This machine calculated all prime numbers under 100000 8 times in 8 seconds
cha0site
  • 10,517
  • 3
  • 33
  • 51
  • Ah kind of like when you need to run Prime95, you have multiple instances of it... Surely there is a way for one process to use multiple cores? Like hash cracking programs do. – bag-man Feb 11 '12 at 22:19
  • Well, one process could use threads to do multiprocessing, but I don't think that's what you meant since a thread is almost a separate process in this context. What we're really talking about here is "heads of execution", be they threads or processes. So, no, there isn't a way to make a single-threaded program run on multiple cores, you have to rewrite it. And sometimes it's _really_ hard. And sometimes it's actually impossible. – cha0site Feb 11 '12 at 22:23
  • Well I guess it won't be as hard as getting the program to work for the Cray as well. Considering I am pretty new to this (What gave me away :P) where would be a good place to start? – bag-man Feb 11 '12 at 22:26
  • @Owen: Well, `UNICOS` looks like it is somewhat similar to Unix (Wikipedia makes think so anyway), so it probably has `fork()`. You should go learn how to use that, I think. – cha0site Feb 11 '12 at 22:32
  • @Owen: Oh, and also, running your program 8 times from the shell shouldn't be much harder than: `for i in \`seq 8\`; do ./my_program; done` – cha0site Feb 11 '12 at 22:34
  • @OwenGarland: I've added an example. – cha0site Feb 12 '12 at 16:10
  • 2
    Oooh! +1'ed now that you have the example. :) – Mysticial Feb 12 '12 at 18:52
7

we really want to see how fast it can go!

Your algorithm to generate prime numbers is very inefficient. Compare it to primegen that generates the 50847534 primes up to 1000000000 in just 8 seconds on a Pentium II-350.

To consume all CPUs easily you could solve an embarrassingly parallel problem e.g., compute Mandelbrot set or use genetic programming to paint Mona Lisa in multiple threads (processes).

Another approach is to take an existing benchmark program for the Cray supercomputer and port it to a modern PC.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • 2
    It doesn't matter that the algorithm is inefficient because the goal isn't to actually calculate the primes, it's to perform a generically difficult task and see how much better or worse it is at it than a modern desktop. An efficient algorithm would just make that comparison harder, and might even ruin the results if it's so good it purposely takes advantage of modern CPU features/quirks. – Numeron Oct 10 '19 at 03:43
  • @Numeron: The speed ratio between Integer division ALU vs. memory (or L3 cache) bandwidth for a sieve of Eratosthenes could be very different on old vs. new machines, so it's actually interesting to look at both benchmarks. (e.g. [Sieve of Eratosthenes in x86 assembly](https://codereview.stackexchange.com/posts/comments/548870) has some discussion about being limited by memory bandwidth when crossing out primes close to or greater than about 512 (so the stride is about a whole cache line, when using a bit-packed sieve, assuming efficient bit access via careful optimization) – Peter Cordes Sep 29 '22 at 08:04
5

The reason you're getting 15% on a hex core processor is because your code uses 1 core at 100%. 100/6 = 16.67%, which using a moving average with process scheduling (your process would be running under normal priority) could easily be reported as 15%.

Therefore, in order to use 100% cpu, you would need to use all the cores of your CPU - launch 6 parallel execution code paths for a hex core CPU and have this scale right up to however many processors your Cray machine has :)

Carl
  • 43,122
  • 10
  • 80
  • 104
  • The issue with doing this is that how can I get a clear figure of the speed of each of the machine? Also the Cray has "vector processors" apparently, so it requires a load more work than this to get it to run properly – bag-man Feb 11 '12 at 22:27
  • Don't know. Probably differences in scheduling processes. – Carl Feb 12 '12 at 23:36
3

Also be very aware how you're loading the CPU. A CPU can do a lot of different tasks, and while many of them will be reported as "loading the CPU 100%" they may each use 100% of different parts of the CPU. In other words, it's very hard to compare two different CPUs for performance, and especially two different CPU architectures. Executing task A may favor one CPU over another, while executing task B it can easily be the other way around (since the two CPUs may have different resources internally and may execute code very differently).

This is the reason software is just as important for making computers perform optimal as hardware is. This is indeed very true for "supercomputers" as well.

One measure for CPU performance could be instructions per second, but then again instructions aren't created equal on different CPU architectures. Another measure could be cache IO performance, but cache infrastructure is not equal either. Then a measure could be number of instructions per watt used, as power delivery and dissipation is often a limiting factor when designing a cluster computer.

So your first question should be: Which performance parameter is important to you? What do you want to measure? If you want to see which machine gets the most FPS out of Quake 4, the answer is easy; your gaming rig will, as the Cray can't run that program at all ;-)

Cheers, Steen

2

TLDR; The accepted answer is both inefficient and incompatible. Following algo works 100x faster.

The gcc compiler available on MAC can't run omp. I had to install llvm (brew install llvm ). But I didn't see CPU idle was going down while running OMP version.

Here is a screenshot while OMP version was running. enter image description here

Alternatively, I used the basic POSIX thread, that can be run using any c compiler and saw almost entire CPU used up when nos of thread = no of cores = 4 (MacBook Pro, 2.3 GHz Intel Core i5). Here is the programme -

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define NUM_THREADS     10
#define THREAD_LOAD 100000
using namespace std;

struct prime_range {
    int min;
    int max;
    int total;
};

void* findPrime(void *threadarg)
{
    int i, primes = 0;
    struct prime_range *this_range;
    this_range = (struct prime_range *) threadarg;

    int minLimit =  this_range -> min ;
    int maxLimit =  this_range -> max ;
    int flag = false;
    while (minLimit <= maxLimit) {
        i = 2;
        int lim = ceil(sqrt(minLimit));
        while (i <= lim) {
            if (minLimit % i == 0){
                flag = true;
                break;
            }
            i++;
        }
        if (!flag){
            primes++;
        }
        flag = false;
        minLimit++;
    }
    this_range ->total = primes;
    pthread_exit(NULL);
}

int main (int argc, char *argv[])
{
    struct timespec start, finish;
    double elapsed;

    clock_gettime(CLOCK_MONOTONIC, &start);

    pthread_t threads[NUM_THREADS];
    struct prime_range pr[NUM_THREADS];
    int rc;
    pthread_attr_t attr;
    void *status;
    pthread_attr_init(&attr);
    pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
    for(int t=1; t<= NUM_THREADS; t++){
        pr[t].min = (t-1) * THREAD_LOAD + 1;
        pr[t].max = t*THREAD_LOAD;
        rc = pthread_create(&threads[t], NULL, findPrime,(void *)&pr[t]);
        if (rc){
            printf("ERROR; return code from pthread_create() is %d\n", rc);
            exit(-1);
        }
    }
    int totalPrimesFound = 0;
    // free attribute and wait for the other threads
    pthread_attr_destroy(&attr);
    for(int t=1; t<= NUM_THREADS; t++){
        rc = pthread_join(threads[t], &status);
        if (rc) {
            printf("Error:unable to join, %d" ,rc);
            exit(-1);
        }
        totalPrimesFound += pr[t].total;
    }
    clock_gettime(CLOCK_MONOTONIC, &finish);
    elapsed = (finish.tv_sec - start.tv_sec);
    elapsed += (finish.tv_nsec - start.tv_nsec) / 1000000000.0;
    printf("This machine calculated all %d prime numbers under %d in %lf seconds\n",totalPrimesFound, NUM_THREADS*THREAD_LOAD, elapsed);
    pthread_exit(NULL);
}

Notice how the entire CPU is used up - enter image description here

P.S. - If you increase no of threads then actual CPU usage go down (Try making no of threads = 20 .) as the system uses more time in context switching than actual computing.

By the way, my machine is not as beefy as @mystical (Accepted answer). But my version with basic POSIX threading works way faster than OMP one. Here is the result -

enter image description here

P.S. Increase threadload to 2.5 million to see CPU usage , as it completes in less than a second.

sapy
  • 8,952
  • 7
  • 49
  • 60
1

Linux (and other Unix systems) report CPU usage such that a single-threaded process keeping a single CPU core busy is reported as 100%. This is what you see on your Ubuntu system: the program is always ready to run, and actually running on one core or the other.

On a machine with N cores, with all cores busy Linux/Unix show that as N * 100%, or a load average of N. (Load average also includes tasks waiting for I/O, or for a CPU to run on, so it can be higher than the number of CPU cores).

e.g. 800% load on a machine with 8 logical cores, whether that's across 4 physical cores with hyperthreading or 8 separate physical cores.


On a Windows machine, all-cores-busy is reported as 100% load. And the utilization of a single-threaded program maxing out one core is 100% / N, highly inconvenient on a many core machine for seeing if a single-threaded program spends any of its time sleeping or waiting for I/O.


However your OS reports it, to max out all the cores at once, you need to either run N processes, or have a process that starts N threads, or some combination of those.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
0

Simply try to Zip and Unzip a big file , nothing as a heavy I/o operations can use cpu.

Nima Mohammadi
  • 353
  • 2
  • 9
0

Try to parallelize your program using, e.g., OpenMP. It is a very simple and effective framework for making up parallel programs.

mikithskegg
  • 806
  • 6
  • 10
0

For a quick improvement on one core, remove system calls to reduce context-switching. Remove these lines:

system("clear");
printf("%d prime numbers calculated\n",primes);

The first is particularly bad, as it will spawn a new process every iteration.

Joel
  • 11,431
  • 17
  • 62
  • 72