75

How do I execute the following shell command using the Python subprocess module?

echo "input data" | awk -f script.awk | sort > outfile.txt

The input data will come from a string, so I don't actually need echo. I've got this far, can anyone explain how I get it to pipe through sort too?

p_awk = subprocess.Popen(["awk","-f","script.awk"],
                          stdin=subprocess.PIPE,
                          stdout=file("outfile.txt", "w"))
p_awk.communicate( "input data" )

UPDATE: Note that while the accepted answer below doesn't actually answer the question as asked, I believe S.Lott is right and it's better to avoid having to solve that problem in the first place!

twasbrillig
  • 17,084
  • 9
  • 43
  • 67
Tom
  • 42,844
  • 35
  • 95
  • 101

9 Answers9

54

You'd be a little happier with the following.

import subprocess

awk_sort = subprocess.Popen( "awk -f script.awk | sort > outfile.txt",
    stdin=subprocess.PIPE, shell=True )
awk_sort.communicate( b"input data\n" )

Delegate part of the work to the shell. Let it connect two processes with a pipeline.

You'd be a lot happier rewriting 'script.awk' into Python, eliminating awk and the pipeline.

Edit. Some of the reasons for suggesting that awk isn't helping.

[There are too many reasons to respond via comments.]

  1. Awk is adding a step of no significant value. There's nothing unique about awk's processing that Python doesn't handle.

  2. The pipelining from awk to sort, for large sets of data, may improve elapsed processing time. For short sets of data, it has no significant benefit. A quick measurement of awk >file ; sort file and awk | sort will reveal of concurrency helps. With sort, it rarely helps because sort is not a once-through filter.

  3. The simplicity of "Python to sort" processing (instead of "Python to awk to sort") prevents the exact kind of questions being asked here.

  4. Python -- while wordier than awk -- is also explicit where awk has certain implicit rules that are opaque to newbies, and confusing to non-specialists.

  5. Awk (like the shell script itself) adds Yet Another Programming language. If all of this can be done in one language (Python), eliminating the shell and the awk programming eliminates two programming languages, allowing someone to focus on the value-producing parts of the task.

Bottom line: awk can't add significant value. In this case, awk is a net cost; it added enough complexity that it was necessary to ask this question. Removing awk will be a net gain.

Sidebar Why building a pipeline (a | b) is so hard.

When the shell is confronted with a | b it has to do the following.

  1. Fork a child process of the original shell. This will eventually become b.

  2. Build an os pipe. (not a Python subprocess.PIPE) but call os.pipe() which returns two new file descriptors that are connected via common buffer. At this point the process has stdin, stdout, stderr from its parent, plus a file that will be "a's stdout" and "b's stdin".

  3. Fork a child. The child replaces its stdout with the new a's stdout. Exec the a process.

  4. The b child closes replaces its stdin with the new b's stdin. Exec the b process.

  5. The b child waits for a to complete.

  6. The parent is waiting for b to complete.

I think that the above can be used recursively to spawn a | b | c, but you have to implicitly parenthesize long pipelines, treating them as if they're a | (b | c).

Since Python has os.pipe(), os.exec() and os.fork(), and you can replace sys.stdin and sys.stdout, there's a way to do the above in pure Python. Indeed, you may be able to work out some shortcuts using os.pipe() and subprocess.Popen.

However, it's easier to delegate that operation to the shell.

Adam Spiers
  • 17,397
  • 5
  • 46
  • 65
S.Lott
  • 384,516
  • 81
  • 508
  • 779
  • Can you explain what the "-c" does? – Tom Nov 17 '08 at 23:37
  • 4
    And I think Awk is actually a good fit for what I am doing, the code is shorter and simpler than the equivalent Python code (it's a domain specific language after all.) – Tom Nov 17 '08 at 23:40
  • 1
    -c tells the shell (the actual application your starting) that the following argument is a command to run. In this case, the command is a shell pipeline. – S.Lott Nov 18 '08 at 00:26
  • 6
    "the code is shorter" does not -- actually -- mean simpler. It only means shorter. Awk has a lot of assumptions and hidden features that make the code very hard to work with. Python, while longer, is explicit. – S.Lott Nov 18 '08 at 00:27
  • Ok, thanks, my code is working now, but I'm not going mark it as the accepted answer yet as I think there must be a way to do this without punting to the shell and having to deal with the escaping issues etc. that that raises. And I replaced "> outfile.txt" with stdout=file("outfile.txt","w"). – Tom Nov 18 '08 at 00:33
  • 1
    Sure, I understand your points & concerns and agree that in many cases my example above would better written in pure Python. I'm not ready to do that in my case yet however as the awk script works and is debugged. Sooner or later, but not right now. – Tom Nov 18 '08 at 00:57
  • 1
    And, that doesn't change the original question, which is how to use subprocess.Popen. Awk and sort are only used for illustration as potential answerers are likely to have them to test with. – Tom Nov 18 '08 at 00:58
  • The original question (how to assemble a shell pipeline with Popen) is something that's (a) complex and (b) never necessary. Using the shell or eliminating the complexity are better approaches. – S.Lott Nov 18 '08 at 01:06
  • Ok, you might have convinced me :) I am still going to keep using Awk for now as rewriting in Python isn't feasible in the short term, however I can see it's simpler/more reasonable to let the shell handle the pipelining. – Tom Nov 18 '08 at 01:49
  • 1
    This answer contains useful info, but I give it -1 because there is a better answer, which answers the question using subprocess.Popen, not the shell, to create pipelines. It can be more difficult to escape commands correctly for the shell, than to assemble a pipeline in python. The excellent "Not tested" answer below shows that it's not so hard to do this directly in python. Your answer is good for the specific question, but not so good for the general problem of creating pipelines in python. – Sam Watkins Mar 14 '13 at 04:50
  • @SamWatkins: you could consider a shell as a DSL designed specifically to be efficient at one-liners that start many processes. Like regex is a DSL for search/replace in a text, Markdown -- for specifying formatting as plain text, etc. DSL is not a general purpose programming lanuage (like Python); you can easily reach its limitations e.g., [`pipes.quote()`](https://docs.python.org/3/library/shlex.html#shlex.quote) may break sometimes. You could try `plumbum` module that embeds the DSL into Python, see [example in my asnwer](http://stackoverflow.com/a/16709666/4279) – jfs Sep 11 '14 at 18:28
  • The `-c` creates a *second* shell. You're better off with `Popen('pipe|line', shell=True)` rather than the, frankly, rather whimsical, `Popen(['-c', 'pipe|line'], shell=True)`. – tripleee Oct 29 '14 at 13:56
  • @tripleee: it does not start the second shell, it runs `/bin/sh -c -c 'awk ...'` command. – jfs Mar 12 '15 at 22:17
  • There are downsides for delegating to the shell, one example is that if you can't handle errors this way, if one of the processes failed the output will still be processed as if it succeeded, try this: false | echo hello – Omry Yadan Dec 09 '18 at 00:11
34
import subprocess

some_string = b'input_data'

sort_out = open('outfile.txt', 'wb', 0)
sort_in = subprocess.Popen('sort', stdin=subprocess.PIPE, stdout=sort_out).stdin
subprocess.Popen(['awk', '-f', 'script.awk'], stdout=sort_in, 
                 stdin=subprocess.PIPE).communicate(some_string)
jfs
  • 399,953
  • 195
  • 994
  • 1,670
Cristian
  • 349
  • 3
  • 2
  • excellent! I modified it to make a self-contained example without the awk script, it uses sed: http://sam.nipl.net/code/python/pipeline.py – Sam Watkins Mar 14 '13 at 04:41
  • 2
    @SamWatkins: you don't need `p1.wait()` in your code. `p1.communicate()` reaps the child process. – jfs Sep 11 '14 at 18:19
  • 7
    Isn't this answer more pythonic and better? It doesn't use `shell=True` as discouraged in the subprocess documentation. I can't see the reason why people up-voted @S.Lott answer. – Ken T Jun 05 '15 at 04:22
  • 3
    @KenT: the shell solution is more readable and less error-prone (if you don't accept untrusted input). The pythonic solution would [use `plumbum` (the shell syntax embedded in Python)](http://stackoverflow.com/a/16709666/4279) or another module that accepts a similar syntax (in a string) and constructs the pipeline for you (same behavior whatever local `/bin/sh` does). – jfs Jul 28 '15 at 20:02
21

To emulate a shell pipeline:

from subprocess import check_call

check_call('echo "input data" | a | b > outfile.txt', shell=True)

without invoking the shell (see 17.1.4.2. Replacing shell pipeline):

#!/usr/bin/env python
from subprocess import Popen, PIPE

a = Popen(["a"], stdin=PIPE, stdout=PIPE)
with a.stdin:
    with a.stdout, open("outfile.txt", "wb") as outfile:
        b = Popen(["b"], stdin=a.stdout, stdout=outfile)
    a.stdin.write(b"input data")
statuses = [a.wait(), b.wait()] # both a.stdin/stdout are closed already

plumbum provides some syntax sugar:

#!/usr/bin/env python
from plumbum.cmd import a, b # magic

(a << "input data" | b > "outfile.txt")()

The analog of:

#!/bin/sh
echo "input data" | awk -f script.awk | sort > outfile.txt

is:

#!/usr/bin/env python
from plumbum.cmd import awk, sort

(awk["-f", "script.awk"] << "input data" | sort > "outfile.txt")()
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Plumbum looks very nice, but I'm wary of "magic." This isn't Perl! – Kyle Strand Dec 04 '14 at 17:52
  • Plumbum does look nice! I wouldn't worry about the magic @KyleStrand - from a quick peek at the docs, you're not required to use the "magic" bits, the module has other ways of doing the same thing - and a quick look at the code shows that the magic is harmless and actually quite slick, not nasty at all. – Tom Dec 06 '14 at 09:13
  • @Tom I don't know, that's a lot of operator overloading with potentially surprising meanings. Part of me loves it, but I'd be reluctant to use it anywhere but in a personal project. – Kyle Strand Dec 08 '14 at 04:48
  • 1
    @KyleStrand: In general I would agree with you but in practice it is much more likely that people either construct the command line incorrrectly (e.g., by forgetting `pipes.quote()`) or introduce bugs while implementing the pipeline in Python, [even `a | b` could be implemented with errors](http://stackoverflow.com/q/28995260/4279). – jfs Mar 12 '15 at 22:22
  • @jfs, How do I read the file if file in coming via POST request using cat or << operator? – CKM Nov 06 '17 at 11:53
  • @chandresh: if you have a new question, ask a separate Stack Overflow question [ask] – jfs Nov 06 '17 at 13:13
13

The accepted answer is sidestepping actual question. here is a snippet that chains the output of multiple processes: Note that it also prints the (somewhat) equivalent shell command so you can run it and make sure the output is correct.

#!/usr/bin/env python3

from subprocess import Popen, PIPE

# cmd1 : dd if=/dev/zero bs=1m count=100
# cmd2 : gzip
# cmd3 : wc -c
cmd1 = ['dd', 'if=/dev/zero', 'bs=1M', 'count=100']
cmd2 = ['tee']
cmd3 = ['wc', '-c']
print(f"Shell style : {' '.join(cmd1)} | {' '.join(cmd2)} | {' '.join(cmd3)}")

p1 = Popen(cmd1, stdout=PIPE, stderr=PIPE) # stderr=PIPE optional, dd is chatty
p2 = Popen(cmd2, stdin=p1.stdout, stdout=PIPE)
p3 = Popen(cmd3, stdin=p2.stdout, stdout=PIPE)

print("Output from last process : " + (p3.communicate()[0]).decode())

# thoretically p1 and p2 may still be running, this ensures we are collecting their return codes
p1.wait()
p2.wait()
print("p1 return: ", p1.returncode)
print("p2 return: ", p2.returncode)
print("p3 return: ", p3.returncode)
Omry Yadan
  • 31,280
  • 18
  • 64
  • 87
  • If `p*.returncode` returns 0, could I assume that there is no error generated? @Omry Yadan – alper Sep 25 '19 at 14:12
  • you can be sure it returned 0. "error generated" is not well defined. it can still print things to stderr. – Omry Yadan Sep 26 '19 at 19:20
  • So as I understand, I have to check stderr as well if it is empty string I can be sure that there is error generated. – alper Sep 26 '19 at 19:24
  • 1
    It depends on what you mean by an error. some programs would print to stderr routinely even if there is no error. – Omry Yadan Sep 28 '19 at 05:40
  • Can this deadlock? `cmd1` is writing to `stderr` and nothing is consuming `p1.stderr`. If the file buffer fills up, the OS will stop executing `p1` process. Same for `p2`. – Foldager Oct 02 '20 at 18:35
  • Yes. this seems plausible. someone needs to consume on the other side otherwise things will stall once the buffer fills up. – Omry Yadan Oct 03 '20 at 19:46
  • You could redirect stderr to DEVNULL to avoid the issue pointed out by @Foldager, I guess – Patrizio Bertoni Apr 07 '22 at 13:56
2

Inspired by @Cristian's answer. I met just the same issue, but with a different command. So I'm putting my tested example, which I believe could be helpful:

grep_proc = subprocess.Popen(["grep", "rabbitmq"],
                             stdin=subprocess.PIPE, 
                             stdout=subprocess.PIPE)
subprocess.Popen(["ps", "aux"], stdout=grep_proc.stdin)
out, err = grep_proc.communicate()

This is tested.

What has been done

  • Declared lazy grep execution with stdin from pipe. This command will be executed at the ps command execution when the pipe will be filled with the stdout of ps.
  • Called the primary command ps with stdout directed to the pipe used by the grep command.
  • Grep communicated to get stdout from the pipe.

I like this way because it is natural pipe conception gently wrapped with subprocess interfaces.

I159
  • 29,741
  • 31
  • 97
  • 132
  • 1
    to avoid zombies, call `ps_proc.wait()` after `grep_proc.communicate()`. `err` is always `None` unless you set `stderr=subprocess.PIPE`. – jfs Feb 11 '15 at 00:45
2

http://www.python.org/doc/2.5.2/lib/node535.html covered this pretty well. Is there some part of this you didn't understand?

Your program would be pretty similar, but the second Popen would have stdout= to a file, and you wouldn't need the output of its .communicate().

geocar
  • 9,085
  • 1
  • 29
  • 37
  • What I don't understand (given the documentation's example) is if I say p2.communicate("input data"), does that actually get sent to p1.stdin? – Tom Nov 17 '08 at 23:35
  • You wouldn't. p1's stdin arg would be set to PIPE and you'd write p1.communicate('foo') then pick up the results by doing p2.stdout.read() – geocar Nov 18 '08 at 21:30
  • @Leonid - The Python people aren't very good at backwards compatibility. You can get much of the same information from: https://docs.python.org/2/library/subprocess.html#popen-objects but I've replaced the link with a wayback machine link anyway. – geocar Jun 28 '14 at 12:48
  • 2
    There's no need for the snarkiness of asking if there's "some part of [the docs] that [OP] didn't understand". As shown in this question, the part of the docs that you posted doesn't actually address the issue of passing input to the first process: http://stackoverflow.com/q/6341451/1858225 – Kyle Strand Dec 04 '14 at 17:42
2

The previous answers missed an important point. Replacing shell pipeline is basically correct, as pointed out by geocar. It is almost sufficient to run communicate on the last element of the pipe.

The remaining problem is passing the input data to the pipeline. With multiple subprocesses, a simple communicate(input_data) on the last element doesn't work - it hangs forever. You need to create a a pipeline and a child manually like this:

import os
import subprocess

input = """\
input data
more input
""" * 10

rd, wr = os.pipe()
if os.fork() != 0: # parent
    os.close(wr)
else:              # child
    os.close(rd)
    os.write(wr, input)
    os.close(wr)
    exit()

p_awk = subprocess.Popen(["awk", "{ print $2; }"],
                         stdin=rd,
                         stdout=subprocess.PIPE)
p_sort = subprocess.Popen(["sort"], 
                          stdin=p_awk.stdout,
                          stdout=subprocess.PIPE)
p_awk.stdout.close()
out, err = p_sort.communicate()
print (out.rstrip())

Now the child provides the input through the pipe, and the parent calls communicate(), which works as expected. With this approach, you can create arbitrary long pipelines without resorting to "delegating part of the work to the shell". Unfortunately the subprocess documentation doesn't mention this.

There are ways to achieve the same effect without pipes:

from tempfile import TemporaryFile
tf = TemporaryFile()
tf.write(input)
tf.seek(0, 0)

Now use stdin=tf for p_awk. It's a matter of taste what you prefer.

The above is still not 100% equivalent to bash pipelines because the signal handling is different. You can see this if you add another pipe element that truncates the output of sort, e.g. head -n 10. With the code above, sort will print a "Broken pipe" error message to stderr. You won't see this message when you run the same pipeline in the shell. (That's the only difference though, the result in stdout is the same). The reason seems to be that python's Popen sets SIG_IGN for SIGPIPE, whereas the shell leaves it at SIG_DFL, and sort's signal handling is different in these two cases.

uncleremus
  • 317
  • 1
  • 11
1

EDIT: pipes is available on Windows but, crucially, doesn't appear to actually work on Windows. See comments below.

The Python standard library now includes the pipes module for handling this:

https://docs.python.org/2/library/pipes.html, https://docs.python.org/3.4/library/pipes.html

I'm not sure how long this module has been around, but this approach appears to be vastly simpler than mucking about with subprocess.

Kyle Strand
  • 15,941
  • 8
  • 72
  • 167
  • `pipes` existed even before `subprocess` module. It builds a (\*nix) shell pipeline (a string with `"|"` that is executed in `/bin/sh`). It is not portable. It is not an alternative to `subprocess` module that is portable and does not require to start a shell to run a command. `pipes` interface is from the time when Enterprise JavaBeans were shiny new things (it is *not* a compliment). Could you provide `pipes` code example that is *"vastly simpler"* than subprocess': `check_call('echo "input data" | a | b > outfile.txt', shell=True)` [from my answer](http://stackoverflow.com/a/16709666/4279)? – jfs Dec 04 '14 at 19:20
  • @J.F.Sebastian Huh. Shouldn't your `check_call` command be equally non-portable? DOS provides standard (i.e. *NIX-comparable, at least AFAIK) `|` behavior, so which systems are you expecting `pipes` not to work on? I admit that using `check_call` with a string representing your shell command is arguably just as simple as using `pipes`, but I was hoping for something that would facilitate the programmatic construction of a pipeline rather than just taking a single string to pass to the shell (a la your other examples). – Kyle Strand Dec 04 '14 at 20:56
  • As I said in my comment, `plumbum` does look nice--it appears to provide exactly the simplicity, flexibility, and power that I'm looking for. However, the syntax is entirely opaque and non-Pythonic. So what I want is something that's approximately as simple and easy as standard \*NIX-shell pipes (if perhaps *slightly* less concise) while still syntactically and stylistically "looking" like Python. `pipes`, at first glance, certainly seems to meet these requirements; if, however, you're right that it's non-portable (which you probably are), then of course it's the least attractive option. – Kyle Strand Dec 04 '14 at 21:01
  • ...and, yes, it looks like a simple piping together of `echo hello world` with `C:\cygwin\bin\tr a-z A-Z` fails on Windows, even though `echo hello world | C:\cygwin\bin\tr.exe a-z A-Z` works. That's...strange and disappointing. – Kyle Strand Dec 04 '14 at 21:22
1

For me, the below approach is the cleanest and easiest to read

from subprocess import Popen, PIPE

def string_to_2_procs_to_file(input_s, first_cmd, second_cmd, output_filename):
    with open(output_filename, 'wb') as out_f:
        p2 = Popen(second_cmd, stdin=PIPE, stdout=out_f)
        p1 = Popen(first_cmd, stdout=p2.stdin, stdin=PIPE)
        p1.communicate(input=bytes(input_s))
        p1.wait()
        p2.stdin.close()
        p2.wait()

which can be called like so:

string_to_2_procs_to_file('input data', ['awk', '-f', 'script.awk'], ['sort'], 'output.txt')
mwag
  • 3,557
  • 31
  • 38