-2

I am using this program written in C to determine the permutations of size 10 of an regular alphabet. When I run the program it only uses 36% of my 3GHz CPU leaving 50% free. It also only uses 7MB of my 8GB of RAM. I would like to use at least 70-80% of my computer's performance and not just this misery. This limitation is making this procedure very time consuming and I don't know when (number of days) I will need to have the complete output. I need help to resolve this issue in the shortest possible time, whether improving the source code or other possibilities.

Any help is welcome even if this solution goes through instead of using the C language use another one that gives me better performance in the execution of the program.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

static int count = 0;

void print_permutations(char arr[], char prefix[], int n, int k) {
    int i, j, l = strlen(prefix);
    char newprefix[l + 2]; 

    if (k == 0) {
       printf("%d %s\n", ++count, prefix);
       return;
    }
    for (i = 0; i < n; i++) {
        //Concatenation of currentPrefix + arr[i] = newPrefix
        for (j = 0; j < l; j++)   
            newprefix[j] = prefix[j];
        newprefix[l] = arr[i];
        newprefix[l + 1] = '\0'; 

        print_permutations(arr, newprefix, n, k - 1); 
    }
}

int main() {            
    int n = 26, k = 10;
    char arr[27] = "abcdefghijklmnopqrstuvwxyz";

    print_permutations(arr, "", n, k);  

    system("pause");
    return 0;        
}
chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • 3
    Without threads, which may or may not be warranted, your program is going to be limited to a single core. That is, if you have a 6-core Intel Core i7 with hyperthreading, resulting in 12 virtual cores. A thread that is always running (never blocks) will consume an entire core, but that's only going to be 8.33% of all CPU as per Windows Task Manager. – selbie Mar 28 '20 at 19:02
  • What exactly are you asking: "Why is this program slow" or "Why doesn't this program use lots of CPU%"? – alex berne Mar 28 '20 at 19:03
  • 2
    As a programmer, you don't really want to strive to use all available CPU. You start by having the most efficient algorithm, then look to parallelize it across multiple cores. – selbie Mar 28 '20 at 19:03
  • 2
    For your program, a good chunk of your program's runtime is blocked on the I/O call of the printf statement itself. If you remove the printf statement, your code will probably run much faster - and you'll see an observable boost in CPU usage. – selbie Mar 28 '20 at 19:05
  • 3
    The program will spent most of its time in the `printf` call to output the results. There is nothing to improve here if you need the results printed to the screen. – walnut Mar 28 '20 at 19:06
  • 1
    " It also only uses 7MB of my 8GB of RAM" What makes you think this program should use more memory? – Support Ukraine Mar 28 '20 at 19:06
  • Which platform? How do you measure CPU usage? – Support Ukraine Mar 28 '20 at 19:07
  • The final optimization to your code is to remove that tail recursion. – selbie Mar 28 '20 at 19:07
  • @4386427 - Let's say two salesmen approached you with separate implementations of a program to sell you. Both implementations meet the requirements and you are choosing which one to buy. The first salesman says, "look, our code is constantly using all available CPU as seen in the task manager because our programmers know how to take advantage of all the power". – selbie Mar 28 '20 at 19:35
  • 1
    @selbie No, tail recursion is perfectly fine, as long as you are using a decent compiler. The problem with this code is that it *doesn't have* tail recursion. – EOF Mar 28 '20 at 19:35
  • @4386427 - The second salesperson says, "our program doesn't use much CPU, but can complete more operations and transactions per second than the other guy's version. Thereby enabling you to use a cheaper server, run a heavier load, or have other services using the same physical box." I don't know about you, but I'm going with the #2 guy - because he can always scale out further with a few tweaks. The first guy has already hit a limit that requires him to rethink his architecture. – selbie Mar 28 '20 at 19:35
  • @EOF - I'm not so sure. Look at gcc 9.3 with -O2 optimization. It still appears to have a recursive call: https://www.godbolt.org/z/CYUZ3i I'm willing to be educated on this if you think I'm misinterpreting the compiler behavior. – selbie Mar 28 '20 at 19:44
  • 1
    @selbie There is no tailrecursion in the OPs code. *None* of the function calls are in tail position. – EOF Mar 28 '20 at 19:53
  • @4386427 - I'm out. – selbie Mar 28 '20 at 20:02
  • Okay, thanks for the answers. I will then create 26 Asynchronous Threads. One for each letter of the alphabet. One thread to obtain axxxxxxxxx strings, another bxxxxxxxxx and so on until zxxxxxxxxx. I hope this works and that I can get more CPU and speed to get the permutations. If anyone has anything else to add about this idea of mine I am grateful. – user3903094 Mar 28 '20 at 20:34
  • @user3903094 I'm pretty sure that won't help you. I just ran the program on my pc (with a much more sensible k=4), and I can confirm that printf is about 95% of your runtime. If you can, avoid printing the permutation or piping your results to a file instead. If printf is your problem like it is for me, increasing the number of threads will increase your CPU-usage, but not speed up the calculation. – alex berne Mar 28 '20 at 22:03
  • alex berne, I already tried to comment the printf statement and the result was not better. The program peaks 30% of the CPU without printf and 38% with printf. I have an Intel i5 and the windows 10 operating system. If a computer has free processing that could be operating to obtain a faster solution, there must be a way to wake it up. Worse is that I am in a situation where I cannot even estimate how long I need to have the computer on to get all the permutations. Yes I will need to write the permutations to a file, I will not need them to view them on the console. – user3903094 Mar 28 '20 at 22:28
  • @user3903094 Why don't you just estimate how long this will take? There are `26!/(26-10)!` permutations, which is `~1.9275e13`. If you have a ~1GHz processor and printing a single permutation takes ~1000 processor cycles (almost all for the `printf()`, substantially because `stdout` is line-buffered and your output contains a newline), you need about ~223 days to print all the permutations. – EOF Mar 29 '20 at 08:05
  • @user3903094 - Why don't you look at the time for a smaller example like k=4 instead of looking at the CPU%? While the CPU% stays the same for both versions, to a file (via ` >file` in linux bash) and to stdout, the time changes from 3.2s (stdout) to 0.1s (file) on my PC. – alex berne Mar 29 '20 at 08:46

2 Answers2

1

There are multiple reasons for your bad experience:

  1. Your metric: Your metric is fundamentally flawed. Peak-CPU% is an imprecise measurement for "how much work does my CPU do". Which normally isn't really what you're most interested in. You can inflate this number my doing more work (like starting another thread that doesn't contribute to the output at all). Your proper metric would be items per second: How many different strings will be printed or written to a file per second. To measure that, start a test run with a smaller size (like k=4), and measure how long it takes.

  2. Your problem: Your problem is hard. Printing or writing down all 26^10 ~1.4e+14 different words with exactly 10 letters will take some time. Even if you changed it to all permutations - which your program doesn't do - it's still ~1.9e13. The resulting file will be 1.4 petabytes - which is most likely more than your hard drive will accept. Also, if you used your CPU to 100% and used one thousand cycles for one word, it'd take 1.5 years. 1000 cycles are an upper bound, you most likely won't be faster that this while still printing your result, as printf usually takes around 1000 cycles to complete.

  3. Your output: Writing to stdout is slow comapred to writing to a file, see https://stackoverflow.com/a/14574238/4838547.

  4. Your program: There are issues with your program that could be a problem for your performance. However, they are dominated by the other problems stated here. With my setup, this program uses 93.6% of its runtime in printf. Therefore, optimizing this code won't yield satisfying results.
alex berne
  • 699
  • 5
  • 17
1

There are fundamental problems with your approach:

  • What are you trying to achieve?

    If you want to enumerate the permutations of size 10 of a regular alphabet, your program is flawed as it enumerates all combinations of 10 letters from the alphabet. Your program will produce 2610 combinations, a huge number, 141167095653376, 141,167 billion! Ignoring the numbering, which will exceed the range of type int, that's more than 1.5 Petabytes, unlikely to fit on your storage space. Writing this at the top speed of 100MB/s would take more than 20 days.

    The number of permutations, that is combinations of distinct letters from the 26 letter alphabet is not quite as large: 26! / 16! which is still large: 19275223968000, 7 times less than the previous result. That is still more than 212 terabytes of storage and 3 days at 100MB/s.

    Storing these permutations is therefore impractical. You could change your program to just count the permutations and measure how long it takes if the count is the expected value. The first step of course is to correct your program to produce the correct set.

  • Test on smaller sets to verify correctness

    Given the expected size of the problem, you should first test for smaller values such as enumerating permutations of 1, 2 and 3 letters to verify that you get the expected number of results.

  • Once you have correctness, only then focus on performance

    Selecting different output methods, from printf("%d %s\n", ++count, prefix); to ++count; puts(prefix); to just ++count;, you will see that most of the time is spent in producing the output. Once you stop producing output, you might see that strlen() consumes a significant fraction of the execution time, which is useless since you can pass the prefix length from the caller. Further improvements may come from using a common array for the current prefix, precluding the need to copy at each recursive step.

    Using multiple threads each producing its own output, for example each with a different initial letter, will not improve the overall time as the bottleneck is the bandwidth of the output device. But if you reduce the program to just enumerate and count the permutations, you might get faster execution with multiple threads, one per core, thereby increasing the CPU usage. But this should be the last step in your development.

  • Memory use is no measure of performance

    Using as much memory as possible is not a goal in itself. Some problems may require a tradeoff between memory and time, where faster solving times are achieved using more core memory, but this one does not. 8MB is actually much more than your program's actual needs: this count includes the full stack space assigned to the program, of which only a tiny fraction will be used.

    As a matter of fact, using less memory may improve overall performance as the CPU will make better use of its different caches.

Here is a modified program:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

static unsigned long long count;

void print_permutations(char arr[], int n, char used[], char prefix[], int pos, int k) {
    if (pos == k) {
        prefix[k] = '\0';
        ++count;
        //printf("%llu %s\n", count, prefix);
        //puts(prefix);
        return;
    }
    for (int i = 0; i < n; i++) {
        if (!used[i]) {
            used[i] = 1;
            prefix[pos] = arr[i];
            print_permutations(arr, n, used, prefix, pos + 1, k);
            used[i] = 0;
        }
    }
}

int main(int argc, char *argv[]) {
    int n = 26, k = 10;
    char arr[27] = "abcdefghijklmnopqrstuvwxyz";
    char used[27] = { 0 };
    char perm[27];
    unsigned long long expected_count;
    clock_t start, elapsed;

    if (argc >= 2)
        k = strtol(argv[1], NULL, 0);
    if (argc >= 3)
        n = strtol(argv[2], NULL, 0);

    start = clock();
    print_permutations(arr, n, used, perm, 0, k);
    elapsed = clock() - start;
    expected_count = 1;
    for (int i = n; i > n - k; i--)
        expected_count *= i;
    printf("%llu permutations, expected %llu, %.0f permutations per second\n",
           count, expected_count, count / ((double)elapsed / CLOCKS_PER_SEC));
    return 0;
}

Without output, this program enumerates 140 million combinations per second on my slow laptop, it would take 1.5 days to enumerate the 19275223968000 10-letter permutations from the 26-letter alphabet. It uses almost 100% of a single core, but the CPU is still 63% idle as I have a dual core hyper-threaded Intel Core i5 CPU. Using multiple threads should yield increased performance, but the program must be changed to no longer use a global variable count.

Community
  • 1
  • 1
chqrlie
  • 131,814
  • 10
  • 121
  • 189