3

I have a very long one-line shell command to be called by Python. The codes are like this:

# "first way"
def run_cmd ( command ):
    print "Run: %s" % command
    subprocess.call (command, shell=True)
run_cmd('''sort -n -r -k5 {3} |head -n 500|awk 'OFS="\t"{{if($2-{1}>0){{print $1,$2-{1},$3+{1},$4,$5}}}}' > {2}'''.format(top_count,extend/2,mid,summit))

These codes works, but it always complains like this:

sort: write failed: standard output: Broken pipe
sort: write error
awk: (FILENAME=- FNR=132) fatal: print to "standard output" failed (Broken pipe)

According to a previous answer, I need to use a longer script to finish this, like:

# "second way"
p1 = Popen("sort -n -r -k5 %s"%summit, stdout=PIPE)
p2 = Popen("head -n 500", stdin=p1.stdout, stdout=PIPE)
# and so on ..........

My questions are:

(1) whether the "second way" will be slower than "first way"

(2) if I have to write in "first way" anyway (because it's faster to write), how can I avoid the complain like broken pipe

(3) what might be the most compelling reason that I shouldn't write in "first way"

Community
  • 1
  • 1
Hanfei Sun
  • 45,281
  • 39
  • 129
  • 237
  • 2
    The "first way" is almost unreadable. So that is a good reason not to use it. – ebarr Nov 21 '12 at 10:41
  • 1
    @ebarr It's unreadable because of the awk part... – Hanfei Sun Nov 21 '12 at 10:42
  • BTW, in the 2nd way of doing it, it might be desirable to close all intermediate file descriptors which are not used any longer by the main program: `p1.stdout.close()` after creating `p2`, and so on. THis ensures that process 1 gets the EOF condition created by process 2. – glglgl Oct 27 '14 at 14:05

3 Answers3

5

Using shell = True can be a security risk if your input data comes from an untrusted source. E.g. what if the content of your mid variable is "/dev/null; rm -rf /". This does not seem to be the case in your scenario, so I would not worry too much about it.

In your code you write the result of awk directly to the filename in mid. To debug the problem, you might want to use subprocess.check_output and read the result from your awk invocation in your python program.

cmd = """sort -n -r -k5 %s |
      head -n 500|
      awk 'OFS="\t"{{if($2-{1}>0){{print $1,$2-{1},$3+{1},$4,$5}}}}'""".format(summit, top_count)

subprocess.check_call(cmd, shell=True, stdout=file)
Hans Then
  • 10,935
  • 3
  • 32
  • 51
  • (1) if you don't use `format()` then you should unescape `{` in `cmd` and add `%` directives, variables. Otherwise the command is broken. (2) `line` is a single character, not line in your code. `check_output()` returns a *single* string. You don't want to iterate one character at a time. To redirect subprocess' stdout to a file: `check_call(cmd, shell=True, stdout=file)` – jfs Sep 10 '14 at 04:22
2

(1) whether the "second way" will be slower than "first way"

Starting a new process is an expensive operation therefore there should not be a large difference between allowing the shell to parse the command line and start child processes and doing it yourself in Python. The only benchmark that matters is your code on your hardware. Measure it.

(2) if I have to write in "first way" anyway (because it's faster to write), how can I avoid the complain like broken pipe

The first "broken pipe" might be similar to: 'yes' reporting error with subprocess communicate(). Try the workaround I've provided there.

The second broken pipe you could fix by redirecting the pipeline stdout to the mid file:

with open(mid, 'wb') as file:
    check_call(pipeline, shell=True, stdout=file)

It implements > {2} in your command without the shell.

(3) what might be the most compelling reason that I shouldn't write in "first way"

if any of top_count, extend, mid, summit come from a source that is not completely under your control then you risk running an arbitrary command under your user.


plumbum module provides both security and readability (measure time performance if it is important for you in this case):

from plumbum.cmd import awk, head, sort

awk_cmd = 'OFS="\t"{if($2-%s>0){print $1,$2-%s,$3+%s,$4,$5}}' % (extend/2,)*3
(sort["-n", "-r", "-k5", summit] | head["-n", "500"] | awk[awk_cmd] > mid)()

See, How do I use subprocess.Popen to connect multiple processes by pipes?

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
1

It is unlikely to be any slower, but you can always test it with timeit to be sure. There are two good reasons not to do it the first way. The first is that while it may be marginally faster to type the first time, readability is greatly reduced, and Readability Counts. The second is that using shell=True is a huge security risk, and should be avoided as a matter of principal.

aquavitae
  • 17,414
  • 11
  • 63
  • 106