4

I have a 120k lines file. Each line has to be processed by an external application. I start a subprocess and send each line to stdin. It takes at least a second to start the application and that's a real bottleneck.

I am looking for a way to make it so I can start the process once and send data to it line by line.

My current code:

    #not pictured: the loop that iterates over all lines. Here the text var is the line I need to pass to the application
pdebug("Sending to tomita:\n----\n", text,"\n----")
    try:
        p = Popen(['tomita/tomitaparser.exe', "tomita/config.proto"], stdout=PIPE, stdin=PIPE, stderr=PIPE)
        stdout_data, stderr_data = p.communicate(input=bytes(text, 'UTF-8'), timeout=45)
        pdebug("Tomita returned stderr:\n", "stderr: "+stderr_data.decode("utf-8").strip()+"\n" )
    except TimeoutExpired:
        p.kill()
        pdebug("Tomita killed")
    stdout_data = stdout_data.decode("utf-8")
    facts = parse_tomita_output(stdout_data)
    pdebug('Received facts:\n----\n',str(facts),"\n----")

The code I tried recently:

try:
    p = Popen(['tomita/tomitaparser.exe', "tomita/config.proto"], stdout=PIPE, stdin=PIPE, stderr=PIPE)

    for news_line in news:
        pdebug("Sending to tomita:\n----\n", news_line.text,"\n----")
        stdout_data, stderr_data = p.communicate(input=bytes(news_line.text, 'UTF-8'), timeout=45)
        pdebug("Tomita returned stderr:\n",stderr_data.decode("utf-8").strip()+"\n" )
        stdout_data = stdout_data.decode("utf-8")
        facts = parse_tomita_output(stdout_data)
        pdebug('Received facts:\n----\n',str(facts),"\n----")

        news_line.grammemes = facts

except TimeoutExpired:
    p.kill()
    pdebug("Tomita killed due to timeout")

The recent code produces this error:

ValueError: Cannot send input after starting communication

So is there a way to send input after I launch the exe, read stdout, flush stdin and stdout, repeat the process?

Euphe
  • 3,531
  • 6
  • 39
  • 69
  • 1
    [pexpect](http://stackoverflow.com/a/28690745/477878) perhaps? – Joachim Isaksson Feb 20 '16 at 12:23
  • @JoachimIsaksson thanks. My problem is that the program doesn't have a clear "I have done parsing this code" marker in stdout. It sends one to stderr though. – Euphe Feb 20 '16 at 12:38
  • 1
    You _might_ be able to do this with `subprocess.Popen`, but that depends on the buffering that your tomitaparser.exe does. You can't do it with `subprocess.communicate`, that's for one-shot interaction with a process. You can get handles to your pipes with `p.stdin`, `p.stdout`, `p.stderr`, assuming each of the `stdin`, `stdout`, and `stderr` args to `.Popen` are set to `subprocess.PIPE`. – PM 2Ring Feb 20 '16 at 12:41
  • 1
    Note that `pexpect` has reduced functionality on Windows: https://pexpect.readthedocs.org/en/stable/overview.html#windows – PM 2Ring Feb 20 '16 at 12:45
  • Possible duplicate of [Multiple inputs and outputs in python subprocess communicate](https://stackoverflow.com/questions/28616018/multiple-inputs-and-outputs-in-python-subprocess-communicate) – Ciro Santilli OurBigBook.com Sep 05 '18 at 08:27

1 Answers1

1

I have a 120k lines file. ... I am looking for a way to make it so I can start the process once and send data to it line by line.

import subprocess

with open(filename, 'rb', 0) as input_file:
    subprocess.check_call(external_app, stdin=input_file)

To answer the question in the title, see links in the description of subprocess tag under the section: Interacting with a subprocess while it is still running such as code examples using pexpect and subprocess.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670