As part of a larger python script I am writing that is associated with genome sequence alignment, I would like to be able to record the number of lines in a gzipped file provided by the user. As far as I can tell, the following bash command is the simplest way to accomplish such a task:
zcat file/path.gz | wc -l
Using the python subprocess module, (and avoiding shell=True) I use the following code in Python to accomplish the same task:
zcatInput = subprocess.check_output(("zcat", filePathFromUser), encoding = "utf-8")
result = subprocess.check_output(("wc", "-l"), encoding = "utf-8", input = zcatInput)
However, some of the files I am trying to do this with are hundreds of millions of lines long once unzipped, so storing the result of the zcat command uses a ridiculous amount of memory (more than my computer has available).
My question is this: What is the best way to circumvent this problem without using shell=True? (From what I've read online, shell=True appears to be very dangerous and should be avoided at all costs).
Additionally: Is this a scenario where the use of shell=True with properly sanitized inputs is acceptable, or do the dangers of shell=true go even further than I am aware of?
In my efforts to answer these questions, I've come up with two possible solutions:
- Create a simple bash script to perform the
zcat
operation on a given file and pipe it towc -l
. Then, call that script using the subprocess module (and shell=False, hooray!) - Use the shell=True argument with the subprocess module to just run the bash command with '|', but sanitize the user input using shlex.quote().
I suppose you could also boil down my question to something more concrete by asking: What is the difference (if any) between the two above approaches in terms of security and memory constraints, and is there a preferable "option 3" that I am missing?