2

I'm trying to do the right thing by porting a Python script that invokes a number of shell command lines via

subprocess.call(... | ... | ... , shell=True) 

to one that avoid the security risk of shell=True by using Popen. So I have written a little sample script to try things out. It executes the command line

awk '{print $1 " - " $2}' < scores.txt | sort | python uppercase.py > teams.txt

as follows:

with open('teams.txt', 'w') as destination:
    with open('scores.txt', 'r') as source:
        p3 = Popen(['python', 'uppercase.py'], stdin=PIPE, stdout=destination)
        p2 = Popen(['sort'], stdin=PIPE, stdout=p3.stdin)
        p1 = Popen(['awk', '{print $1 " - " $2}'], stdin=source, stdout=p2.stdin)
        p1.communicate()

This program works with a small data set.

Now I was struck by the following line from the documentation of the communicate method:

Note The data read is buffered in memory, so do not use this method if the data size is large or unlimited.

What? But I have huge files that need to be awk'd and sorted, among other things. The reason I tried to use communicate in the first place is that I saw this warning for subprocess.call:

Note Do not use stdout=PIPE or stderr=PIPE with this function as that can deadlock based on the child process output volume. Use Popen with the communicate() method when you need pipes.

I'm really confused. It seems my choices are:

  • use call with shell=True (security risk, they say)
  • use PIPE with call (but then risk deadlock)
  • use Popen and communicate (but my data is too large, 100s of megabytes).

What am I missing? How do I create a several process pipeline in Python for very large files without shell=True, or is shell=True acceptable?

Ray Toal
  • 86,166
  • 18
  • 182
  • 232
  • I'd love to see other responses to this. In my experience, ya, you're stuck. I haven't found a good way around this. The only way I've found is to replace `sort`, `awk`, etc, with Python code in your script. However, I'm not familiar with this type of "popen chaining" either. – mtik00 Sep 11 '14 at 04:28
  • See JF Sebastian's answer below. It would appear there is indeed a good way around this. – Ray Toal Sep 13 '14 at 06:39
  • Thanks Ray. Not quite what I was hoping for (3rd party), but plumbum looks like a very interesting dependency! – mtik00 Sep 14 '14 at 02:05

1 Answers1

4

The note about "buffered in memory" only makes sense if you use something like stdout=PIPE. It doesn't apply to stdout=file (os.dup2() does the redirecting at OS file descriptor level, there is nothing to do for .communicate() method).

Don't use call and PIPE. call() is (simplified) just Popen().wait() i.e., it does not read from the pipe. Do not use PIPE unless you read from (write to) the pipe (there is no point).

In your code p1.communicate() doesn't read any data. You could replace it with p1.wait(). Your code is missing p3.stdin.close(); ... ; p2.stdin.close(); ... ; p3.wait(), p2.wait()

Otherwise, the code works for large files.

On shell=True

If the command is hardcoded (as in your question) then there is no security risk. If the command may come from an untrusted source then it doesn't matter how do you run this command (the untrusted source may run whatever it likes in this case). If only some arguments come from an untrusted source then you could use plumbum module to avoid reimplementing the pipeline yourself:

from plumbum.cmd import awk, sort, python

(awk['{print $1 " - " $2}'] < untrusted_filename | sort |
 python["uppercase.py"] > "teams.txt")()

See How do I use subprocess.Popen to connect multiple processes by pipes?

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Ah, I was not aware of plumbum! The filenames are indeed untrusted in the actual application (they are passed in as names of "client feeds", let's say) and were only hardcoded for my sample, so plumbum looks nice. I was worried about the output of p2 being a pipe (I think) as it was p3's stdin, and the large number of closes and waits made me feel there should be a better way. From a first look at plumbum I am thinking "plumbum is to subprocess piping as requests is to urllib2" :) – Ray Toal Sep 13 '14 at 06:37
  • @RayToal: if filenames are untrusted then consider what happens if `/dev/zero`, `/etc/passwd` are given. – jfs Sep 13 '14 at 12:57