0

I'm trying to count the number of lines that start with ">" using grep and then store that value as a Python integer object [int]. I'm running the command with subprocess using the call method.

In [8]: x = subprocess.call(f"grep -c '^>' {path}", shell=True)
4626

In [9]: x
Out[9]: 0

It's going to stdout but I want it to go into the variable x.

O.rka
  • 29,847
  • 68
  • 194
  • 309
  • 1
    This code has serious security vulnerabilities. Keep in mind that `$(rm -rf ~)` is completely legal as part of a filename; so is `; rm /etc/passwd`. – Charles Duffy Jan 31 '18 at 20:59
  • 1
    It'd be much safer to use `subprocess.call(['grep', '-c', '^>', '--', path])`, or, barring that, `subprocess.call(['''grep -c '^>' -- "$1"''', '_', path], shell=True)`. – Charles Duffy Jan 31 '18 at 21:00
  • 1
    See the "warning" section in red in the documentation at https://docs.python.org/2/library/subprocess.html#frequently-used-arguments for more discussion. – Charles Duffy Jan 31 '18 at 21:01
  • So much dread looking at this line: `rm -rf ~` – O.rka Jan 31 '18 at 21:05
  • @O.rka: You can also just do this within Python: `with open(filename, 'rb') as handle: total = sum(line.startswith('>') for line in handle)`. – Blender Jan 31 '18 at 21:08
  • Thanks @Blender I think the `bash` way is faster and I'm trying to speed it up since I'm working with 5GB text files. I just recently started using `subprocess` – O.rka Jan 31 '18 at 21:12
  • Could well be -- `grep` is heavily optimized. That said, if you're going to be doing this repeatedly, getting your data into an indexed format so this stops being an O(n) operation is in your interest. (If the lines are sorted, for instance, then it's a bisect operation to find either the start or the end of the section that starts with `>`; that gives you the number of *bytes* between those two regions in amortized constant time). – Charles Duffy Jan 31 '18 at 21:13
  • `x = int(subprocess.run(['grep', '-c', '^>', "--", path], stdout=subprocess.PIPE).stdout.strip())` – O.rka Jan 31 '18 at 21:13
  • ...once you know the byte offsets, whether it requires an O(n) scan through that section to count number of *lines* is a question of whether the line length is constant, or if there's a line number in the data format -- if the latter, then it could just be a matter of subtraction, and there you are. – Charles Duffy Jan 31 '18 at 21:15
  • (point I'm making being that slow operations can generally be avoided if your data structures are well-considered). – Charles Duffy Jan 31 '18 at 21:16
  • 1
    @O.rka: `grep` takes twice as long for me for a 2.5GB file, regardless of the number of lines starting with `>`. You may want to test it out before committing to a particular solution due to performance. You can shrink the performance gap by using `LC_ALL=C grep ...`. – Blender Jan 31 '18 at 22:34

0 Answers0