Subprocess very slow when calling external egrep and less

Question

I'm trying to build a python script that will allow me dynamic build up on egrep -v attributes and pipe the output into less (or more).
The reason why I want to use external egrep+less is because files that I am processing are very large text files (500MB+). Reading them first into a list and processing all natively through Python is very slow.

However, when I use os.system or subprocess.call, everything is very slow at the moment I want to exit less output and return back to python code.

My code should work like this:
1. ./myless.py messages_500MB.txt
2. Less -FRX output of messages_500MB.txt is shown (complete file).
3. When I press 'q' to exit less -FRX, python code should take over and display prompt for user to enter text to be excluded. User enters it and I add this to the list
4. My python code builds up egrep -v 'exclude1' and pipes the output to less
5. User repeats step 3 and enters another stuff to be excluded
6. Now my python code calls egrep -v 'exclude1|exclude2' messages_500MB.txt | less -FRX
7. And the process continues

However, this does not work as expected.
* On my Mac, when user press q to exit less -FRX, it takes few seconds for raw_input prompt to be displayed
* On Linux machine, I get loads of 'egrep: writing output: Broken pipe'
* If, (linux only) while in less -FRX, I press CTRL+C, exiting less -FRX for some reason becomes much much quicker (as intended). On Mac, my python program breaks

Here is sample of my code:

excluded = list()
myInput = ''
while myInput != 'q':
    grepText = '|'.join(excluded)
    if grepText == '':
        command = 'egrep "" ' + file + ' | less -FRX'
    else:
        command = 'egrep -v "' + grepText + '" ' + file + ' | less -FRX'

    subprocess.call(command, shell=True)
    myInput = raw_input('Enter text to exclude, q to exit, # to see what is excluded: ')
    excluded.append(myInput)

Any help would be much appreciated

score 2 · Answer 1 · answered May 06 '15 at 08:40

Actually I figured out what the problem is

I did some research on error that is visible when running my script on Linux ("egrep: writing output: Broken pipe") and that lead me to the answer:
Issue is when I use egrep -v 'xyz' file | less, when I quit less, subprocess still continues to run egrep and on large files (500MB+) this takes a while.

Aparently, subprocess takes two programs separately and runs the first one (egrep) even after the second one (less) exited

To properly resolve my issue, I use something like this:

command = 'egrep -v "something" <filename>'
cmd2 = ('less', '-FRX') 
egrep = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
subprocess.check_call(cmd2, stdin=egrep.stdout)
egrep.terminate()

By piping out first process into second process stdin, I am now able to terminate egrep immediately when I exit less and now my python script is flying :)

Cheers,
Milos

note: `.terminate()` terminates the shell process (`/bin/sh`), the grandchild processes might continue to run. See [How to terminate a python subprocess launched with shell=True](http://stackoverflow.com/q/4789837/4279). Or better yet, run `grep` directly without the shell. You should also close the pipes, to avoid fd leaks. See [How do I use subprocess.Popen to connect multiple processes by pipes?](http://stackoverflow.com/q/295459/4279) — jfs, May 09 '15 at 16:59
Thank you Sebastian. This is definitely something that I need to look into in order to improve script I've been making for my work. Cheers for that — Milos Kostic-Veljkovic, May 15 '15 at 03:21

Subprocess very slow when calling external egrep and less

1 Answers1

Linked