Does python's subprocess.PIPE store all data sent to stdout into a buffer until it is read from and is there a way to stream?

Question

import subprocess
from pathlib import Path

def check_file_wc_count(path: Path, regex: str):
  try:
    zgrep = subprocess.run(['zgrep', regex, path], check=True, stdout=subprocess.PIPE)
  except subprocess.CalledProcessError as e:
    return 0
  output = subprocess.run(['wc', '-l'], input=zgrep.stdout, capture_output=True, check=True)
  return int(output.stdout.decode('utf-8').strip())

When reading large files (which is gzipped, hence the zgrep), I observe large memory usage. Something that (I think) does not normally occur when using the linux utilities on its own. I am guessing it's because of how I am using the subprocess.PIPE and I am guessing it stores the stdout of the zgrep call in a buffer until it is read into the input of the wc call.

Is this assumption correct and is there a way to avoid this in python?

You can use the `pipesize` keyword argument to create a smaller buffer in Linux. This requires Python 3.10 or later. — chepner, Oct 25 '22 at 21:09
consider: https://stackoverflow.com/questions/803265/getting-realtime-output-using-subprocess — juanpa.arrivillaga, Oct 25 '22 at 21:10
That said, you probably don't need a pipe if all you are doing is using `wc -l` to count how many matches `zgrep` found; you can simply use the `-c` option to `zgrep` to get a count instead of the actual matches. (`zgrep ... | wc -l` and `zgrep -c ...` are roughly equivalent.) — chepner, Oct 25 '22 at 21:10
IOW, don't use `subprocess.run`, use the `subprocess.Popen` constructor — juanpa.arrivillaga, Oct 25 '22 at 21:13
`output = subprocess.run(f"zgrep {regex} {path} | wc -l", check=True, shell=True, stdout=subprocess.PIPE)` — Omer Dagry, Oct 25 '22 at 21:21
I'm also doing a mapping from path to the wc count, is there a way to output the filename in one command with the `-c` flag? e.g. `find . -type f -name "{file_regex}" -exec zgrep -c "{match_regex}" {} \;` — ajoseps, Oct 26 '22 at 13:19

ajoseps · Answer 1 · 2022-10-26T13:22:43.727

It does seem like using subprocess.PIPE the way I am in the posted example does save the stdout to an internal buffer. To avoid this, @chepner and @Omer Dagry solutions in the comments seem to work. I think using zgrep -c {regex} {path} is the most straight forward solution:

def check_num_quotes(path: Path, regex: str):
  try:
    output = subprocess.run(['zgrep', '-c', regex, path], capture_output=True, check=True)
  except subprocess.CalledProcessError as e:
    return 0
  return int(output.stdout.decode('utf-8').strip())

EDIT: My full use case is to search through a directory, find all matching files, and get a wc for each matching regex match in each matching file and output which file. This can be done in a single command:

find . -type f -name "{file_regex}" -exec zgrep -cH "{match_regex}" {} \;

Does python's subprocess.PIPE store all data sent to stdout into a buffer until it is read from and is there a way to stream?

1 Answers1