0

I am using a Python script to handle part of a C++ program, but the process is going extremely slowly.

I have a basic function for running shell commands and retrieving the output:

std::string ShellCommand::RunShellCommand(std::string cmd) {
    char buffer[128];
    std::string Response = "";
    FILE* pipe = popen(cmd.c_str(), "r");
    while (!feof(pipe)) {
       // use buffer to read and add to result
       if (fgets(buffer, 128, pipe) != NULL)
          Response += buffer;
    }
    pclose(pipe);
    return Response;
}

I use this command in a loop:

void loop(int data) {
    while (HashSize(HashGenerator(data)) > 25) {
        data += 1;
    }
}

HashSize() takes the string hash and counts the leading zeros by iterating over each character until a non-zero character is found, counting the total binary zeros. The HashGenerator() function uses the generic shell command function above to activate a python script:

import hashlib
import sys

# Get the data from argv
Data = sys.argv[1]

# Generate the hash
Hash = hashlib.sha256()
Hash.update(Data)

# Output the generated hash
print(Hash.hexdigest())

Obviously, the data for the script is taken from the shell command.

This all works fine, but is very slow (data is incremented from 0 to roughly ~18,000 in a few hours). Now, my actual implementation of this is a bit more complicated, but I suspect the system call to Python is what is causing the problems here.

What is the overhead for this process? Do you agree that this is where my problem is? Is there a way I can speed up this procedure?

I want this question to be useful for others, so I want to avoid being overly hardware specific; however, I should mention that this is running on a Raspberry Pi Zero W, which isn't known for breaking computational records. I would still expect it to go faster than this, though.

eric
  • 320
  • 2
  • 8
  • How long usually takes the python script if not run via `popen(...)` ? – Ilian Zapryanov Jul 20 '20 at 20:12
  • 1
    [Why `while(!feof(file))` is always wrong](https://stackoverflow.com/questions/5431941/while-feof-file-is-always-wrong) – Barmar Jul 20 '20 at 20:18
  • @Ilian Zapryanov I don't have an exact figure, but it is much faster. If I do the loop in python, it is about 10x the speed. – eric Jul 20 '20 at 20:19
  • Are you running the script multiple times, or just once? Every time you call `popen()` it has to create a shell process, it parses the command line, that starts a new python process, it parses the script. – Barmar Jul 20 '20 at 20:21
  • You certainly can't expect the answer to this question to be the same whether the hardware in question is, for example, Raspberry PI, or a 128-core Threadripper. with 128GB of DDR RAM. It should be obvious that the numbers are completely different on each hardware, and a useful answer for your case can only be obtained by benchmarking this on your hardware. – Sam Varshavchik Jul 20 '20 at 20:21
  • @SamVarshavchik He did say what his hardware is. – Barmar Jul 20 '20 at 20:22
  • Is there a reason why you're doing this in Python? Isn't there a crypto library for C++? – Barmar Jul 20 '20 at 20:23
  • @Barmar Yes, it is used in every iteration. Are you saying that it isn't Python so much as it is popen? – eric Jul 20 '20 at 20:25
  • There is some setup time whenever you execute python, on my system about .06 seconds which would be about 15 minutes over 18000 executions. You could run the program in the background and rewrite it to take commands from stdin, one line per command to reduce startup costs. – tdelaney Jul 20 '20 at 20:26
  • @Barmar in the full implementation, I end up sending this data to a server which will regenerate the hash based on the data to verify veracity. The server is using the exact same Python script, so I felt it would be wise to be consistent to prevent problems. I'm starting to think otherwise. – eric Jul 20 '20 at 20:26
  • @tdelaney I like what you're thinking, but it seems a bit inelegant. Surely there must be a cleaner solution. (I don't mean to disparage your comment, I appreciate the suggestion!) – eric Jul 20 '20 at 20:28
  • Did you compile your code with compiler optimizations enabled? If not, do so. – Jesper Juhl Jul 20 '20 at 20:29
  • Also, can you rewrite that logic in `python` with `subprocess` and `communicate`? Also why you wait on `feof()` try to write it in another python with `popen` without waiting the `EOF` and see if it's still slow. – Ilian Zapryanov Jul 20 '20 at 20:34
  • @eric - actually its a pretty standard pipeline from the python script's point of view - that the calling program supplies the input and consumes the output is also reasonably common. – tdelaney Jul 20 '20 at 20:35
  • @JesperJuhl I will give that a try, thank you. The answer is probably a quick google away but do you know how to do this for G++ off hand? – eric Jul 20 '20 at 21:12
  • @IlianZapryanov I can give it a try! I've only used subprocess in Python with the caller being the master and the subprocess being the slave, I will have to do some research into how to invert that model. – eric Jul 20 '20 at 21:13

1 Answers1

2

The overhead is the following:

  • popen() - It creates a pipe (low overhead) and starts a sh process (relatively expensive)
  • sh - Parses the command line (low overhead) and starts a python process (relatively expensive)
  • python - Parses and compiles the Python script (somewhat expensive) and executes the script.

The 2nd and 3rd steps are the same as if you run the Python script by hand from the terminal.

Starting up lots of processes in a loop is generally not a good idea. If you can't do the calculation in C++, a better way to do it would be to set up bi-directional communication with the Python process, send it an input and then read the result for that. Unfortunately, this is more complicated to code, because popen() can only create one-direction communication.

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • Great info, thank you for sharing. It seems my plan to implement the same code client side to match server side isn't worth the cost. I'll try to do this purely in C++. – eric Jul 20 '20 at 21:10