5

When I run the following code

from subprocess import call, check_output, Popen, PIPE

gr = Popen(["grep", "'^>'", myfile], stdout=PIPE)
sd = Popen(["sed", "s/.*len=//"], stdin=gr.stdout)
gr.stdout.close()
out = sd.communicate()[0]
print out

Where myfile looks like this:

>name len=345
sometexthere
>name2 len=4523
someothertexthere
...
...

I get

None

When the expected output is a list of numbers:

345
4523
...
...

The corresponding command I run in the terminal is

grep "^>" myfile | sed "s/.*len=//" > outfile

So far, I have tried playing around with escaping and quoting in different ways, such as escaping slashes in the sed or adding extra quotation marks for grep, but the combinatorial possibilities there are large.

I have also considered just reading in the file and writing Python equivalents of grep and sed, but the file is very large (I could always read line by line though), it will always run on UNIX-based systems and I am still curious on where I made errors.

Could it be that

sd.communicate()[0]

returns some kind of object (instead of the list of integers) for which None is the type?

I know I can grab the output with check_output in simple cases:

sam = check_output(["samn", "stats", myfile])

but not sure how to make it work with more complicated situations were stuff is getting piped.

What are some productive approaches to get the expected results with subprocess?

EKarl
  • 149
  • 1
  • 11

4 Answers4

4
  1. Don't put single quotes around ^> in the grep line. This isn't bash so all arguments will be passed to the underlying program literally.
  2. You need to redirect sd's stdout to PIPE.
Steven
  • 5,654
  • 1
  • 16
  • 19
4

As suggested you need to stdout=PIPE in the second process and remove the single quotes from "'^>'":

gr = Popen(["grep", "^>", myfile], stdout=PIPE)
Popen(["sed", "s/.*len=//"], stdin=gr.stdout, stdout=PIPE)
......

But this can be done simply using pure python and re:

import re
r = re.compile("^\>.*len=(.*)$")
with open("test.txt") as f:
    for line in f:
        m =  r.search(line)
        if m:
            print(m.group(1))

Which would output:

345
4523

If the lines that start with > always have the number and the number is always at the end after len= then you don't actually need a regex either:

with open("test.txt") as f:
    for line in f:
        if line.startswith(">"):
            print(line.rsplit("len=", 1)[1])
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • 1
    don't use `check_output()` here: it may hang `grep` process if `sed` dies prematurely (until gc closes `gr.stdout` pipe in the parent). To avoid `.close()` call, start in reverse -- see [How do I use subprocess.Popen to connect multiple processes by pipes?](http://stackoverflow.com/a/9164238/4279) – jfs Dec 25 '15 at 03:09
  • @J.F.Sebastian, I just removed it as there is no need for a subprocess call at all, also what is `gc`? – Padraic Cunningham Dec 25 '15 at 11:45
  • Yes, there is no need for a subprocess here. gc is garbage collection. – jfs Dec 25 '15 at 15:28
2

You need to redirect stdout on your second Popen call or the output will just go to the parent process stdout and communicate will return None.

sd = Popen(["sed", "s/.*len=//"], stdin=gr.stdout, stdout=PIPE)
tdelaney
  • 73,364
  • 6
  • 83
  • 116
1

Padraic Cunningham answer is acceptable

How to apply single quotes in your command line string

use shlex

.

import shlex
from subprocess import call, check_output, Popen, PIPE
gr = Popen(shlex.split("grep '^>' my_file"), stdout=PIPE)
sd = Popen(["sed", "s/.*len=//"], stdin=gr.stdout,stdout=PIPE)
gr.stdout.close()
out = sd.communicate()[0]
print out
repzero
  • 8,254
  • 2
  • 18
  • 40