5

I'm running a C tool compiled to wasm using emscripten. The tool works on very large files. When running this tool normally on the CLI, often operations stream the results and terminate the program early once enough data has been returned. For example you might run:

./tool <input-file> | head -n 100

The tool would terminate after it detects stdout has been closed by head, effectively only reading a small portion of the input.

The problem is that stdout with emscripten appears to be asynchronous (by overriding Module.print), so the tool runs to completion every time. Is there a way to make it block on stdout so I can only read as much as I need and then terminate the tool?

anderspitman
  • 9,230
  • 10
  • 40
  • 61
  • *The problem is that stdout with emscripten appears to be asynchronous* The data has to be going somewhere. There could just be a large buffer. What OS are you running on? If you're on Linux, you can try using [the `stdbuf` utility](http://man7.org/linux/man-pages/man1/stdbuf.1.html), as in this answer: https://stackoverflow.com/a/25548995/4756299 – Andrew Henle Feb 13 '20 at 22:07
  • Sorry if I wasn't very clear. It's not a buffering problem. Quite the opposite. I have too much data coming out, and I want to be able to tell the WebAssembly process to block until I've processed the data already received. – anderspitman Feb 13 '20 at 23:27

2 Answers2

0

You can redirect the output to a file and then put the task in the background. Meanwhile monitor the log file. When it reaches 100 lines kill the child pid.

Some like this should work:

rm -f /tmp/log
touch /tmp/log
./tool input_file > /tmp/log 2>&1 &
pid=$!

while sleep 1
do
    ret=`cat /tmp/log | wc -l`
    if [ "$ret" -ge 100 ]
    then
        kill $pid
        exit 0
    fi
done

I put in the touch to create an empty log file. This will avoid a race condition where we cat it before it is created by the child process. wc -l will return the number of lines. Change the sleep value to whatever is appropriate for your test time.

jmq
  • 1,559
  • 9
  • 21
  • You don't even need a tmp file to do that. You can just collect lines from Module.print directly and kill the process when you have enough. The problem is the tool can waste a lot of cycles in the mean time. – anderspitman Feb 13 '20 at 23:21
0

You need to implement a way to tell your tool to stop. There are many ways to do this. Two that come to mind:

  1. Have it take an extra argument indicating the number of lines of output after which it should stop, then call it with this argument. This is the simplest approach and easiest to implement. The main drawback is that you need to know the max number of lines ahead of time, so that you can include it in your call arguments, and tool must be able to accurately know how to translate that into when to stop-- but if that is the case, which it sounds like, then just do this and you're done.

But I suppose if your tool did not know how to count lines-- perhaps it's just outputting blobs, or perhaps you have some downstream filter that is only counting some lines towards your maximum and in any case, your tool needs some other function to tell it when to stop, then this would not work, in which case, read on...

  1. Use a callback. Create and export another function e.g. tool_stop(). In your Module.print override function, at the appropriate time, call tool_stop(). In your C code, create some flag, let's call it stop_processing, that is visible to your tool command and also visible to your function that is processing the input. In your processing loop (e.g. before each fread call), your tool command checks this flag and if it is set, it stops processing (when I say "visible", that could mean you make it a global variable, if you'll never have more than one concurrent invocation running, or make it part of some context data that is allocated via some init call and passed whenever a process() or stop() call is made, and then deallocated via a destroy() call. The latter approach is generally cleaner, more scalable and more maintainable, though is a bit more work since you have to add init + destroy, and add a context pointer to each of your function definitions and calls)
mwag
  • 3,557
  • 31
  • 38