0
import subprocess
from pathlib import Path

def check_file_wc_count(path: Path, regex: str):
  try:
    zgrep = subprocess.run(['zgrep', regex, path], check=True, stdout=subprocess.PIPE)
  except subprocess.CalledProcessError as e:
    return 0
  output = subprocess.run(['wc', '-l'], input=zgrep.stdout, capture_output=True, check=True)
  return int(output.stdout.decode('utf-8').strip())

When reading large files (which is gzipped, hence the zgrep), I observe large memory usage. Something that (I think) does not normally occur when using the linux utilities on its own. I am guessing it's because of how I am using the subprocess.PIPE and I am guessing it stores the stdout of the zgrep call in a buffer until it is read into the input of the wc call.

Is this assumption correct and is there a way to avoid this in python?

ajoseps
  • 1,871
  • 1
  • 16
  • 29
  • 1
    You can use the `pipesize` keyword argument to create a smaller buffer in Linux. This requires Python 3.10 or later. – chepner Oct 25 '22 at 21:09
  • consider: https://stackoverflow.com/questions/803265/getting-realtime-output-using-subprocess – juanpa.arrivillaga Oct 25 '22 at 21:10
  • 1
    That said, you probably don't need a pipe if all you are doing is using `wc -l` to count how many matches `zgrep` found; you can simply use the `-c` option to `zgrep` to get a count instead of the actual matches. (`zgrep ... | wc -l` and `zgrep -c ...` are roughly equivalent.) – chepner Oct 25 '22 at 21:10
  • IOW, don't use `subprocess.run`, use the `subprocess.Popen` constructor – juanpa.arrivillaga Oct 25 '22 at 21:13
  • did you try doing it in one subprocess call instead of 2? – Omer Dagry Oct 25 '22 at 21:20
  • 1
    `output = subprocess.run(f"zgrep {regex} {path} | wc -l", check=True, shell=True, stdout=subprocess.PIPE)` – Omer Dagry Oct 25 '22 at 21:21
  • I'm also doing a mapping from path to the wc count, is there a way to output the filename in one command with the `-c` flag? e.g. `find . -type f -name "{file_regex}" -exec zgrep -c "{match_regex}" {} \;` – ajoseps Oct 26 '22 at 13:19

1 Answers1

0

It does seem like using subprocess.PIPE the way I am in the posted example does save the stdout to an internal buffer. To avoid this, @chepner and @Omer Dagry solutions in the comments seem to work. I think using zgrep -c {regex} {path} is the most straight forward solution:

def check_num_quotes(path: Path, regex: str):
  try:
    output = subprocess.run(['zgrep', '-c', regex, path], capture_output=True, check=True)
  except subprocess.CalledProcessError as e:
    return 0
  return int(output.stdout.decode('utf-8').strip())

EDIT: My full use case is to search through a directory, find all matching files, and get a wc for each matching regex match in each matching file and output which file. This can be done in a single command:

find . -type f -name "{file_regex}" -exec zgrep -cH "{match_regex}" {} \;
ajoseps
  • 1,871
  • 1
  • 16
  • 29