I'm trying to do the right thing by porting a Python script that invokes a number of shell command lines via
subprocess.call(... | ... | ... , shell=True)
to one that avoid the security risk of shell=True
by using Popen
. So I have written a little sample script to try things out. It executes the command line
awk '{print $1 " - " $2}' < scores.txt | sort | python uppercase.py > teams.txt
as follows:
with open('teams.txt', 'w') as destination:
with open('scores.txt', 'r') as source:
p3 = Popen(['python', 'uppercase.py'], stdin=PIPE, stdout=destination)
p2 = Popen(['sort'], stdin=PIPE, stdout=p3.stdin)
p1 = Popen(['awk', '{print $1 " - " $2}'], stdin=source, stdout=p2.stdin)
p1.communicate()
This program works with a small data set.
Now I was struck by the following line from the documentation of the communicate method:
Note The data read is buffered in memory, so do not use this method if the data size is large or unlimited.
What? But I have huge files that need to be awk'd and sorted, among other things. The reason I tried to use communicate
in the first place is that I saw this warning for subprocess.call
:
Note Do not use stdout=PIPE or stderr=PIPE with this function as that can deadlock based on the child process output volume. Use Popen with the communicate() method when you need pipes.
I'm really confused. It seems my choices are:
- use
call
withshell=True
(security risk, they say) - use
PIPE
withcall
(but then risk deadlock) - use
Popen
andcommunicate
(but my data is too large, 100s of megabytes).
What am I missing? How do I create a several process pipeline in Python for very large files without shell=True
, or is shell=True
acceptable?