3

I want to call an external process from python. The process I'm calling reads an input string and gives tokenized result, and waits for another input (binary is MeCab tokenizer if that helps).

I need to tokenize thousands of lines of string by calling this process.

Problem is Popen.communicate() works but waits for the process to die before giving out the STDOUT result. I don't want to keep closing and opening new subprocesses for thousands of times. (And I don't want to send the whole text, it may easily grow over tens of thousands of -long- lines in future.)

from subprocess import PIPE, Popen

with Popen("mecab -O wakati".split(), stdin=PIPE,
           stdout=PIPE, stderr=PIPE, close_fds=False,
           universal_newlines=True, bufsize=1) as proc:
    output, errors = proc.communicate("foobarbaz")

print(output)

I've tried reading proc.stdout.read() instead of using communicate but it is blocked by stdin and doesn't return any results before proc.stdin.close() is called. Which, again means I need to create a new process everytime.

I've tried to implement queues and threads from a similar question as below, but it either doesn't return anything so it's stuck on While True, or when I force stdin buffer to fill by repeteadly sending strings, it outputs all the results at once.

from subprocess import PIPE, Popen
from threading import Thread
from queue import Queue, Empty

def enqueue_output(out, queue):
    for line in iter(out.readline, b''):
        queue.put(line)
    out.close()

p = Popen('mecab -O wakati'.split(), stdout=PIPE, stdin=PIPE,
          universal_newlines=True, bufsize=1, close_fds=False)
q = Queue()
t = Thread(target=enqueue_output, args=(p.stdout, q))
t.daemon = True
t.start()

p.stdin.write("foobarbaz")
while True:
    try:
        line = q.get_nowait()
    except Empty:
        pass
    else:
        print(line)
        break

Also looked at the Pexpect route, but it's windows port doesn't support some important modules (pty based ones), so I couldn't apply that as well.

I know there are a lot of similar answers, and I've tried most of them. But nothing I've tried seems to work on Windows.

EDIT: some info on the binary I'm using, when I use it via command line. It runs and tokenizes sentences I give, until I'm done and forcibly close the program.

(...waits_for_input -> input_recieved -> output -> waits_for_input...)

Thanks.

umutto
  • 7,460
  • 4
  • 43
  • 53
  • 1
    Since you’re just running MeCab in `wakati` mode, can you not just pipe all the lines of your input (newlines and all) into the process’ stdin? – Ahmed Fasih Mar 24 '17 at 04:15
  • @AhmedFasih I can but the input is the comments, posts etc in a user database, so all the inputs together is a very large file and can grow exponentially to the point it could be larger than memory soon. I would prefer to do it sequentially if I can as it also benefits my general code logic (performing tokenization per user -> processing user -> etc...). – umutto Mar 24 '17 at 04:19
  • 1
    If mecab uses C `FILE` streams with default buffering, then piped `stdout` has a 4 KiB buffer. Have you tried writing input repeatedly until mecab fills and flushes its `stdout` buffer? Does mecab have a command-line option to force using no buffering or line buffering instead of full buffering? – Eryk Sun Mar 24 '17 at 05:23
  • @eryksun Checking the documentation, it has a input buffer size setting (8KB). But no output buffer size. I've tried filling my stdin.write query with 8KB of empty space, which worked (yay) but it seems hackish. Can I force it to flush its buffer some other way? When I use it on command line it tokenizes my inputs correctly without closing the process. – umutto Mar 24 '17 at 06:00
  • 1
    There's no generic way on Windows to modify the output buffer size used by `FILE` streams. The C runtime situation is too complicated. A process can link statically or dynamically to one or more CRTs. The situation on Linux is different, so there are commands like `stdbuf` that can attempt to modify the buffering of standard `FILE` streams. – Eryk Sun Mar 24 '17 at 06:22
  • @eryksun Thanks for the replies, I'll flush it this way for now then. Can you post your comments as an answer so I can accept it. – umutto Mar 24 '17 at 06:38
  • 1
    FWIW, the Tao of Windows says that the correct solution is to rebuild the external process as a DLL. Of course, that isn't always practical. – Harry Johnston Mar 24 '17 at 16:44
  • @HarryJohnston thanks! that actually looks promising, I've built a dll and tried to import it using ctypes but failed with the return types because I'm not very familiar with C. I'll work on it a bit more. – umutto Mar 27 '17 at 03:36

4 Answers4

3

If mecab uses C FILE streams with default buffering, then piped stdout has a 4 KiB buffer. The idea here is that a program can efficiently use small, arbitrary-sized reads and writes to the buffers, and the underlying standard I/O implementation handles automatically filling and flushing the much-larger buffers. This minimizes the number of required system calls and maximizes throughput. Obviously you don't want this behavior for interactive console or terminal I/O or writing to stderr. In these cases the C runtime uses line-buffering or no buffering.

A program can override this behavior, and some do have command-line options to set the buffer size. For example, Python has the "-u" (unbuffered) option and PYTHONUNBUFFERED environment variable. If mecab doesn't have a similar option, then there isn't a generic workaround on Windows. The C runtime situation is too complicated. A Windows process can link statically or dynamically to one or several CRTs. The situation on Linux is different since a Linux process generally loads a single system CRT (e.g. GNU libc.so.6) into the global symbol table, which allows an LD_PRELOAD library to configure the C FILE streams. Linux stdbuf uses this trick, e.g. stdbuf -o0 mecab -O wakati.


One option to experiment with is to call CreateConsoleScreenBuffer and get a file descriptor for the handle from msvcrt.open_osfhandle. Then pass this as stdout instead of using a pipe. The child process will see this as a TTY and use line buffering instead of full buffering. However managing this is non-trivial. It would involve reading (i.e. ReadConsoleOutputCharacter) a sliding buffer (call GetConsoleScreenBufferInfo to track the cursor position) that's actively written to by another process. This kind of interaction isn't something that I've ever needed or even experimented with. But I have used a console screen buffer non-interactively, i.e. reading the buffer after the child has exited. This allows reading up to 9,999 lines of output from programs that write directly to the console instead of stdout, e.g. programs that call WriteConsole or open "CON" or "CONOUT$".

Eryk Sun
  • 33,190
  • 5
  • 92
  • 111
0

Here is a workaround for Windows. This should also be adaptable to other operating systems. Download a console emulator like ConEmu (https://conemu.github.io/) Start it instead of mecab as your subprocess.

p = Popen(['conemu'] , stdout=PIPE, stdin=PIPE,
      universal_newlines=True, bufsize=1, close_fds=False)

Then send the following as the first input:

mecab -O wakafi & exit

You are letting the emulator handle the file output issues for you; the way it normally does when you manually interact with it. I am still looking into this; but already looks promising...

Only problem is conemu is a gui application; so if no other way to hook into its input and output, one might have to tweak and rebuild from sources (it's open source). I haven't found any other way; but this should work.

I have asked the question about running in some sort of console mode here; so you can check that thread also for something. The author Maximus is on SO...

Seyi Shoboyejo
  • 489
  • 4
  • 11
  • Won't make any difference. It is output going to the console that is treated differently; whether or not an instance of the command prompt is present has no effect. Also, what's with the semicolon? – Harry Johnston Jul 20 '17 at 22:13
  • My reasoning is you should have nothing to do with running mecab directly; but instead run cmd.exe and then just send the command to run mecab to it (exiting after running mecab). This way it should be like manually starting cmd.exe and entering the command. Or does the output buffer issue cause problems when run like that? – Seyi Shoboyejo Jul 22 '17 at 08:33
  • Then there is the brute force approach to just start cmd.exe (not as a subprocess) ; send keystrokes to itn make the command for running mecab redirect the output to a file (command >out.txt); and get your tokenized output from there. Can you not run mecab from the command line at all?? – Seyi Shoboyejo Jul 22 '17 at 08:44
  • The problem occurs whenever output is redirected. It doesn't matter whether output is redirected by Python or by the command processor, i.e., when you say `>out.txt` - it's all the same as far as the child program is concerned. If you *don't* redirect output, as is usually the case when the program is run manually, there's no problem - except that in this scenario that makes it difficult for the parent process to see what the output is. Eryksun's answer goes into more detail. – Harry Johnston Jul 22 '17 at 09:24
  • Okay I get you. But I would have thought that the involved process of using a console screen buffer to communicate with the child would have been handled by an important program like cmd.exe. It would use that to get output from the child and then write to the output file you specified. Surely Microsoft is big enough to write all that code in an hour. No need to redirect output with a pipe here. If that's the implementation of cmd.exe what about powershell? I mean if a program can print something to screen it can also write it to file. Is it possible they cannot do that; why?? – Seyi Shoboyejo Jul 22 '17 at 13:23
  • Or does the child process print directly to cmd.exe's screen without any intervention from cmd.exe?? – Seyi Shoboyejo Jul 22 '17 at 13:36
  • If output hasn't been redirected, the child process writes directly to the console window. If output *has* been redirected, the child process writes directly to the file or pipe. The command interpreter (cmd.exe) never reads content from the console screen buffer, and the console screen buffer was never intended to be used for IPC in the first place! (Keep in mind that the problem only occurs when you're trying to run a program intended for interactive use non-interactively, i.e., when you're using it in a way it wasn't designed for. It shouldn't be surprising that this is difficult.) – Harry Johnston Jul 22 '17 at 23:05
  • *I mean if a program can print something to screen it can also write it to file.* - certainly; if the program was *designed* with the expectation that standard output might be redirected to a file or pipe, there's no problem, it just has to turn off buffering [as described here](https://stackoverflow.com/a/7876756/886887). It's only when the child program *wasn't* designed to be used in this way that you run into trouble. – Harry Johnston Jul 22 '17 at 23:13
  • Yes you really showed me something I wasn't clear on there: cmd.exe is just another console app and that black screen doesn't belong to it. All the same, some things should just work well like that and there should have been a wrapper somewhere. It will certainly be less efficient but surely it would still be very beneficial. This kind of code that Eryksun suggested is too low level to require developers trying to solve other problems to grapple with. – Seyi Shoboyejo Jul 23 '17 at 14:04
  • Many makers of console apps may not know to turn off buffering; at least until they find themselves on the other side. This doesn't have to require complex solutions from 'end' developers... – Seyi Shoboyejo Jul 23 '17 at 14:04
  • Only a very small fraction of Windows users ever want to do this so I guess it's a case of [Minus 100 Points](https://blogs.msdn.microsoft.com/ericgu/2004/01/12/minus-100-points/). On Windows, the correct solution (when you expect your program to be used by another program) is to provide it as a DLL or perhaps a COM object. At any rate, given that we've established that your answer is incorrect, [I recommend that you delete it](https://meta.stackexchange.com/q/88346/187745) before it starts attracting downvotes. (No offense intended; that's just how Stack Overflow works.) – Harry Johnston Jul 23 '17 at 21:36
  • Might a console emulator like conemu not do what I was expecting from cmd.exe? I have actually learned a lot from all this. Thanks! – Seyi Shoboyejo Jul 24 '17 at 10:19
0

The code

while True:
    try:
        line = q.get_nowait()
    except Empty:
        pass
    else:
        print(line)
        break

is essentially the same as

print(q.get())

except less efficient because it burns CPU time while waiting. The explicit loop won't make data from the subprocess arrive sooner; it arrives when it arrives.

For dealing with uncooperative binaries I have a few suggestions, from best to worst:

  1. Find a Python library and use that instead. It appears that there's an official Python binding in the MeCab source tree and I see some prebuilt packages on PyPI. You can also look for a DLL build that you can call with ctypes or another Python FFI. If that doesn't work...

  2. Find a binary that flushes after each line of output. The most recent Win32 build I found online, v0.98, does flush after each line. Failing that...

  3. Build your own binary that flushes after each line. It should be easy enough to find the main loop and insert a flush call in it. But MeCab seems to explicitly flush already, and git blame says that the flush statement was last changed in 2011, so I'm surprised you ever had this problem and I suspect that there may have just been a bug in your Python code. Failing that...

  4. Process the output asynchronously. If your concern is that you want to deal with the output in parallel with the tokenization for performance reasons, you can mostly do that, after the first 4K. Just do the processing in the second thread instead of stuffing the lines in a queue. If you can't do that...

  5. This is a terrible hack but it may work in some cases: intersperse your inputs with dummy inputs that produce at least 4K of output. For example, you could output 2047 blank lines after every real input line (2047 CRLFs plus the CRLF from the real output = 4K), or a single line of b'A' * 4092 + b'\r\n', whichever is faster.

Not on this list at all is an approach suggested by the two previous answers: directing the output to a Win32 console and scraping the console. This is a terrible idea because scraping gets you cooked output as a rectangular array of characters. The scraper has no way to know whether two lines were originally one overlong line that wrapped. If it guesses wrong, your outputs will get out of sync with your inputs. It's impossible to work around output buffering in this way if you care at all about the integrity of the output.

benrg
  • 1,395
  • 11
  • 13
0

I guess the answer, if not the solution, can be found here https://github.com/ikriv/ConsoleProxy/blob/master/src/Tools/Exec/readme.md

I guess, because I had a similar problem, which I worked around, and could not try this route because this tool is not available for Windows 2003, which is the OS I had to use (in a VM for a legacy application).

I'd like to know if I guessed right.