2

This is my input file format:

@SRR2056440.1 1 length=100
TGTAGGTCTGAGCAGCTTGTCCTGGCTGTGTCCATGTCAGAGCAACGGCCCAAGTCTGGGTCTGGGGGGGAAGGTGTCATGGAGCCCCCTACGATTCCCA
+SRR2056440.1 1 length=100
BCBFFFEFHHHHHJJJJJJIJJJJJJJJIJHHIJJIIJJJJJIJJIJJJJJJJJFHIJJJHHHHHHFDDDBDDD>>ACDEDDDDDDDDDDDDDDDDDEDD
@SRR2056440.2 2 length=100
CTGCCGCCACCGCAGCAGCCACAGGCAGAGGAGGACGAGGACGACTGGGAATCGTAGGGGGCTCCATGACACCTTCCCCCCCAGACCCAGACTTGGGCCA
+SRR2056440.2 2 length=100
CCCFFFFFHHHHHJJJJJJJJJJJIJIJIGJGGIGGJIJJEHFEDDDDDDDDDDABDDDDDDDDDDDDDDADDDDDDDDDDDCDDDDDDBBDDCDDBDD@
@SRR2056440.3 3 length=100
TCTGCCGCCACCGCAGCAGCCACAGGCAGAGGAGGACGAGGACGACTGGGAATCGTAGGGGGCTCCATGACACCTTCCCCCCCAGACCCAGACTTGGGCC
+SRR2056440.3 3 length=100
CCCFFFFFHGHHHJJJJJIJJJJJJIJJIJJJIJJIIIGIJ<CDBCDDDDDDDDDDDDDDDDDDDDDDDDDDDDDCDDDDDDDDDDDDDDDDDDCDCBDD

This is the command I want to execute:

cat input.fq | awk 'NR%4==2{sum+=length($0);nr++;sumsq+=length($0)*length($0)}END{printf"%.1f\t%.1f\n",sum/nr,sqrt(sumsq/nr-(sum/nr)**2)}'

And the output of the command:

100.0 0.0

I want to execute that command inside a python script using subprocess. I have done several attempts but I can't figure out, this is my last try:

awk_comm = r"""'NR%4==2{sum+=length($0);nr++;sumsq+=length($0)*length($0)}END{printf"%.1f\t%.1f\n",sum/nr,sqrt(sumsq/nr-(sum/nr)**2)}'"""
cmd = ['cat', 'input.fq', '|', 'awk', awk_comm]
p2 = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
out1, err = p2.communicate()

EDIT:

I can't see any error in the output. It gets stuck, running forever.

cucurbit
  • 1,422
  • 1
  • 13
  • 32
  • BTW, `cat input.fq | ...` is bad practice even in shell -- it forces your `awk` to read a FIFO from `cat`, which is necessarily slower than just reading from the file direct; moreover, with a direct file handle you can reread, seek around, etc; but a FIFO can only be read once front-to-back. – Charles Duffy May 16 '17 at 15:31
  • 1
    anyhow, when you pass an array with `shell=True`, the result is `subprocess.Popen(['sh', '-c']+yourarray, shell=False)`. That means that the only thing passed as source for the shell to parse is the **very first** element of that array. – Charles Duffy May 16 '17 at 15:37
  • BTW -- do see the warning in https://docs.python.org/2/library/subprocess.html#frequently-used-arguments before using `shell=True`. – Charles Duffy May 16 '17 at 19:50

4 Answers4

2

The following works for me.

>>> awk_comm = r"""cat input.fq | awk 'NR%4==2{sum+=length($0);nr++;sumsq+=length($0)*length($0)}END{printf"%.1f\t%.1f\n",sum/nr,sqrt(sumsq/nr-(sum/nr)**2)}'"""
>>> p2 = subprocess.Popen(awk_comm, stdout=subprocess.PIPE,shell=True)
>>> res = p2.communicate()
>>> res
('100.0\t0.0\n', None)
Rolf of Saxony
  • 21,661
  • 5
  • 39
  • 60
1

There's no point to shell=True here. Just set up your subprocess.Popen object to do everything you'd otherwise use the shell for:

# the original awk code, with whitespace added for readability
awk_command = r"""
NR%4==2 {
  sum+=length($0);
  nr++;
  sumsq+=length($0)*length($0)
}
END {
  printf "%.1f\t%.1f\n", sum/nr, sqrt(sumsq/nr-(sum/nr)**2)
}
"""

p2 = subprocess.Popen(
  ['awk', awk_command],
  stdin=open('input.fq', 'r'),  # pass a file handle to input.fq directly on awk's stdin
  stdout=subprocess.PIPE,
  stderr=subprocess.PIPE)
out1, err = p2.communicate()
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • 1
    Thank you very much. This makes more sense. But there is a reason why I'm using `cat`. Before executing the command I check whether the file is compressed with gzip, if so, I open the file using zcat and otherwise, using cat. Can I open gziped file using open? – cucurbit May 16 '17 at 18:26
  • 1
    @cucurbit, for what I'd do, see https://gist.github.com/charles-dyfis-net/14ec07896f3e899315c367420c54afd1. That way you're passing a direct file handle where possible, or piping from `gunzip` where not. – Charles Duffy May 16 '17 at 19:45
  • Thank you! If you remove the quotes from input_filename: `input_file = open('input_filename', 'r')` it works perfectly well. – cucurbit May 18 '17 at 12:21
0

By default, Python doesn't use the shell to run commands...but pipes are evaluated by the shell!! You need to pass shell=True:

p2 = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
kirbyfan64sos
  • 10,377
  • 6
  • 54
  • 75
0

You can use the commands module to achieve this:

import commands
awk_comm = r"""'NR%4==2{sum+=length($0);nr++;sumsq+=length($0)*length($0)}END{printf"%.1f\t%.1f\n",sum/nr,sqrt(sumsq/nr-(sum/nr)**2)}'"""
p1 = commands.getoutput('cat input.fq | awk ' + awk_comm)
print p1

Hope this helps

Arnab
  • 1,037
  • 3
  • 16
  • 31
  • Using string concatenation to form shell commands is innately prone to shell injection attacks. It's not exploitable here, because all the strings are hardcoded, but if you wanted to let the user specify the name of `input.fq`, the naive approach would allow arbitrary commands to be run by embedding them in that name. – Charles Duffy May 16 '17 at 15:44
  • 1
    Moreover, `commands` is explicitly deprecated in favor of `subprocess`. You'll note that `commands` doesn't even exist at all in Python 3; and the docs for `subprocess` describes [how to use it in place of `commands`](https://docs.python.org/3/library/subprocess.html#legacy-shell-invocation-functions). – Charles Duffy May 16 '17 at 15:46
  • Thanks for the insight @CharlesDuffy – Arnab May 16 '17 at 15:50