10

I have a handful of Python scripts each of which make heavy use of sorting, uniq-ing, counting, gzipping and gunzipping, and awking. As a first run through the code I've used subprocess.call with (yes I know of the security risks that's why I said it is a first pass) shell=True. I have a little helper function:

def do(command):
    start = datetime.now()
    return_code = call(command, shell=True)
    print 'Completed in', str(datetime.now() - start), 'ms, return code =', return_code
    if return_code != 0:
        print 'Failure: aborting with return code %d' % return_code
        sys.exit(return_code)

Scripts make use of this helper as in the following snippets:

do('gunzip -c %s | %s | sort -u | %s > %s' % (input, parse, flatten, output))
do("gunzip -c %s | grep 'en$' | cut -f1,2,4 -d\|| %s > %s" % (input, parse, output))
do('cat %s | %s | gzip -c > %s' % (input, dedupe, output))
do("awk -F ' ' '{print $%d,$%d}' %s | sort -u | %s | gzip -c > %s" % params)
do('gunzip -c %s | %s | gzip -c > %s' % (input, parse, output))
do('gunzip -c %s | %s > %s' % (input, parse, collection))
do('%s < %s >> %s' % (parse, supplement, collection))
do('cat %s %s | sort -k 2 | %s | gzip -c > %s' % (source,other_source,match,output)

And there are many more like these, some with even longer pipelines.

One issue I notice is that when a command early in a pipeline fails, the whole command will still succeed with exit status 0. In bash I fix this with

set -o pipefail

but I do not see how this can be done in Python. I suppose I could put in an explicit call to bash but that seems wrong. Is it?

In lieu of an answer to that specific question, I'd love to hear alternatives to implementing this kind of code in pure Python without resorting to shell=True. But when I attempt to use Popen and stdout=PIPE the code size blows up. There is something nice about writing pipelines on one line as a string, but if anyone knows an elegant multiline "proper and secure" way to do this in Python I would love to hear it!

An aside: none of these scripts ever take user input; they run batch jobs on a machine with a known shell which is why I actually ventured into the evil shell=True just to see how things would look. And they do look pretty easy to read and the code seems so concise! How does one remove the shell=True and run these long pipelines in raw Python while still getting the advantages of aborting the process if an early component fails?

Ray Toal
  • 86,166
  • 18
  • 182
  • 232
  • Why not create a single bash script that does what you want, and call that script from Python? Then you control the whole pipeline thing better. – Floris Feb 13 '14 at 00:00
  • Or better yet, either just make a pure Bash script, or convert all the shell external calls to native Python – DopeGhoti Feb 13 '14 at 00:06
  • 1
    Ah, the calls to `do` are only part of much larger Python scripts. There's too much logic in them (around the subprocess calls) to use bash, which is great for the pipelines but poor when dealing with arrays and conditional logic. – Ray Toal Feb 13 '14 at 00:14

1 Answers1

8

You can set the pipefail in the calls to system:

def do(command):
  start = datetime.now()
  return_code = call([ '/bin/bash', '-c', 'set -o pipefail; ' + command ])
  ...

Or, as @RayToal pointed out in a comment, use the -o option of the shell to set this flag: call([ '/bin/bash', '-o', 'pipefail', '-c', command ]).

Alfe
  • 56,346
  • 20
  • 107
  • 159
  • 1
    Thanks, nice. I ended up using `call([ '/bin/bash', '-o', 'pipefail', '-c', command ])`. – Ray Toal Feb 20 '14 at 01:33
  • The idea was based completely on your answer, though. :) I had never thought about calling bash. It's obvious now. – Ray Toal Feb 21 '14 at 01:37
  • This answer totally defeats the safety provided by shell=False. I would rather use shell=True in this case. – David Rissato Cruz Jul 25 '19 at 16:28
  • @DavidRissatoCruz Yeah, well, the OP explicitly asked about executing pipes provided as strings. You _need_ a shell for this, either the implicit one you get using `shell=True` or the explicit one you get with my answer. In both cases you will execute a string, so you need to _trust_ that string. But I don't see how `shell=True` could help here unless you mean that it makes the trust issue more obvious. – Alfe Aug 08 '19 at 15:16
  • 2
    I never meant saying your solution doesn't work, please don't take it this way. However he asked "alternatives to implementing this kind of code in pure Python without resorting to shell=True" ... "proper and secure way", and my point is that calling `bash -c` is as unsafe as using `shell=True`. From the security perspective, both solutions are exactly the same. – David Rissato Cruz Aug 11 '19 at 03:30
  • @DavidRissatoCruz You are right. The problem is the "wash me but don't make me wet" kind. PO states that pipes given as strings are kind of "nice" but on the other hand wants a "secure" solution and that without resorting to "shell=True". All three are not achievable in one solution at the same time, IMHO. But one can go the way to say, well, fixed strings, as long as I provide them myself completely, are not really insecure. They only become insecure if they contain parts provided by the input. The only other option would be to let go of strings completely for specifying the pipes. – Alfe Aug 13 '19 at 07:44