0

As part of a larger python script I am writing that is associated with genome sequence alignment, I would like to be able to record the number of lines in a gzipped file provided by the user. As far as I can tell, the following bash command is the simplest way to accomplish such a task:

zcat file/path.gz | wc -l

Using the python subprocess module, (and avoiding shell=True) I use the following code in Python to accomplish the same task:

zcatInput = subprocess.check_output(("zcat", filePathFromUser), encoding = "utf-8")
result = subprocess.check_output(("wc", "-l"), encoding = "utf-8", input = zcatInput)

However, some of the files I am trying to do this with are hundreds of millions of lines long once unzipped, so storing the result of the zcat command uses a ridiculous amount of memory (more than my computer has available).

My question is this: What is the best way to circumvent this problem without using shell=True? (From what I've read online, shell=True appears to be very dangerous and should be avoided at all costs).
Additionally: Is this a scenario where the use of shell=True with properly sanitized inputs is acceptable, or do the dangers of shell=true go even further than I am aware of?

In my efforts to answer these questions, I've come up with two possible solutions:

  1. Create a simple bash script to perform the zcat operation on a given file and pipe it to wc -l. Then, call that script using the subprocess module (and shell=False, hooray!)
  2. Use the shell=True argument with the subprocess module to just run the bash command with '|', but sanitize the user input using shlex.quote().

I suppose you could also boil down my question to something more concrete by asking: What is the difference (if any) between the two above approaches in terms of security and memory constraints, and is there a preferable "option 3" that I am missing?

  • 1
    Don't use a subprocess at all. Use the `gzip` library to read the file. – Barmar Mar 03 '21 at 04:54
  • If you have to use subprocess, use `stdout=PIPE` instead of returning the output. – Barmar Mar 03 '21 at 04:54
  • @Barmar Thank you for the suggestions. I should have mentioned that I am aware of the gzip library in Python, but since I only want to count the number of lines in the file, I'd like to use the bash command `wc -l` here, since it runs significantly faster. Also, unless I am missing something, the subprocess.check_output function is a convenience function which includes the stdout=PIPE assignment. (Documentation here: https://docs.python.org/3/library/subprocess.html#subprocess.check_output) – b.morledge-hampton Mar 03 '21 at 05:42
  • Read the linked question, it shows how to connect the output of `zcat` directly to the input of `wc -l`. – Barmar Mar 03 '21 at 05:45
  • Whether or not you use `shell=True` here is pretty much beside the point. If you are not using Python's features at all, probably using a shell _instead of_ Python is the way to go. It's not hard to reimplement `wc -l` in native Python so you might just do that instead if you really want a Python script. Briefly, `import gzip; count = 0; with gzip.open(filePathFromUser) as f: for line in f: count += 1` (perhaps it could be more elegantly phrased with `reduce`) – tripleee Mar 03 '21 at 06:01
  • Thank you for the suggestions. @Barmar the question you linked does not answer my question as their solution uses several orders of magnitude more memory than using shell=True along with the pipe command. – b.morledge-hampton Mar 03 '21 at 06:20
  • @tripleee This is very true. I am probably just being stubborn here and wanting to keep too much functionality within my python script while maintaining the speed of bash commands. I will try to approach the problem from a different angle. Thank you for the assistance. – b.morledge-hampton Mar 03 '21 at 06:20
  • How are you examining memory usage? Python itself should not use much memory at all though the subprocesses probably buffer all they can if there is free memory and the disk is fast. – tripleee Mar 03 '21 at 06:40
  • @tripleee It actually is the Python3.8 process. I can it see its memory usage slowly grow in task manager (I'm using WSL to run my code). Its memory footprint slowly approaches the 16 GB of ram I have installed and then the script fails with: `OSError: [Errno 12] Cannot allocate memory` – b.morledge-hampton Mar 03 '21 at 07:28
  • Sounds like a WSL bug maybe? I ran `tracemalloc` on a much more complex pipeline and I see no memory at all being allocated. (Unfortunately Ideone won't let me run `tracemalloc` but the code is here: https://ideone.com/wXELYK) – tripleee Mar 03 '21 at 07:41
  • @tripleee Very interesting... Thanks so much for the insight. It appears there's definitely more going on here than I know how to diagnose. I may look into it more (I'm sure it'll be a great learning experience), but for the moment I think I'm maybe ready to switch gears to another approach... :P Thanks again for all the help. – b.morledge-hampton Mar 03 '21 at 08:14
  • The approach in the linked question should be effectively equivalent to running the shell command `zcat filename | wc -l` – Barmar Mar 03 '21 at 14:57
  • My apologies. I just realized that I was ignorant to the fundamental differences behind subprocess.run and subprocess.popen. It turns out, the former has to complete before its output can be used elsewhere. Hence, the memory issues. Thank you for all the help, and I'm sorry I didn't realize this earlier. – b.morledge-hampton Mar 03 '21 at 20:07

0 Answers0