37

In C# (.NET 4.0 running under Mono 2.8 on SuSE) I would like to run an external batch command and capture its ouput in binary form. The external tool I use is called 'samtools' (samtools.sourceforge.net) and among other things it can return records from an indexed binary file format called BAM.

I use Process.Start to run the external command, and I know that I can capture its output by redirecting Process.StandardOutput. The problem is, that's a text stream with an encoding, so it doesn't give me access to the raw bytes of the output. The almost-working solution I found is to access the underlying stream.

Here's my code:

        Process cmdProcess = new Process();
        ProcessStartInfo cmdStartInfo = new ProcessStartInfo();
        cmdStartInfo.FileName = "samtools";

        cmdStartInfo.RedirectStandardError = true;
        cmdStartInfo.RedirectStandardOutput = true;
        cmdStartInfo.RedirectStandardInput = false;
        cmdStartInfo.UseShellExecute = false;
        cmdStartInfo.CreateNoWindow = true;

        cmdStartInfo.Arguments = "view -u " + BamFileName + " " + chromosome + ":" + start + "-" + end;

        cmdProcess.EnableRaisingEvents = true;
        cmdProcess.StartInfo = cmdStartInfo;
        cmdProcess.Start();

        // Prepare to read each alignment (binary)
        var br = new BinaryReader(cmdProcess.StandardOutput.BaseStream);

        while (!cmdProcess.StandardOutput.EndOfStream)
        {
            // Consume the initial, undocumented BAM data 
            br.ReadBytes(23);

// ... more parsing follows

But when I run this, the first 23bytes that I read are not the first 23 bytes in the ouput, but rather somewhere several hundred or thousand bytes downstream. I assume that StreamReader does some buffering and so the underlying stream is already advanced say 4K into the output. The underlying stream does not support seeking back to the start.

And I'm stuck here. Does anyone have a working solution for running an external command and capturing its stdout in binary form? The ouput may be very large so I would like to stream it.

Any help appreciated.

By the way, my current workaround is to have samtools return the records in text format, then parse those, but this is pretty slow and I'm hoping to speed things up by using the binary format directly.

S.Serpooshan
  • 7,608
  • 4
  • 33
  • 61
Sten Linnarsson
  • 373
  • 1
  • 3
  • 4
  • The only thing I can think of offhand would be to set the desired encoding to Unicode and then pick apart each char from the StreamReader into two bytes. Which would be a horrible hack, and would probably fail miserably if the output had an odd number of bytes. A workaround would be to implement your own encoding that maps bytes directly to their respective char values, like ASCII but without converting the upper set into '?'. But I'll let someone else come up with a proper answer. :) – cdhowie Nov 10 '10 at 18:17
  • 10 years after the fact, but could it be that your data is being consumed and processed when the Process instance is getting it ready for the OutputDataReceived event? I was looking for it myself, but I think the Process class can't be used to capture binary data, for this reason. – Luc VdV Jul 09 '21 at 13:25

4 Answers4

43

Using StandardOutput.BaseStream is the correct approach, but you must not use any other property or method of cmdProcess.StandardOutput. For example, accessing cmdProcess.StandardOutput.EndOfStream will cause the StreamReader for StandardOutput to read part of the stream, removing the data you want to access.

Instead, simply read and parse the data from br (assuming you know how to parse the data, and won't read past the end of stream, or are willing to catch an EndOfStreamException). Alternatively, if you don't know how big the data is, use Stream.CopyTo to copy the entire standard output stream to a new file or memory stream.

Bradley Grainger
  • 27,458
  • 4
  • 91
  • 108
  • 2
    And where Stream.CopyTo should be called to handle whole output that may be extremely huge? – SerG Feb 26 '14 at 13:17
8

Since you explicitly specified running on Suse linux and mono, you can work around the problem by using native unix calls to create the redirection and read from the stream. Such as:

using System;
using System.Diagnostics;
using System.IO;
using Mono.Unix;

class Test
{
    public static void Main()
    {
        int reading, writing;
        Mono.Unix.Native.Syscall.pipe(out reading, out writing);
        int stdout = Mono.Unix.Native.Syscall.dup(1);
        Mono.Unix.Native.Syscall.dup2(writing, 1);
        Mono.Unix.Native.Syscall.close(writing);

        Process cmdProcess = new Process();
        ProcessStartInfo cmdStartInfo = new ProcessStartInfo();
        cmdStartInfo.FileName = "cat";
        cmdStartInfo.CreateNoWindow = true;
        cmdStartInfo.Arguments = "test.exe";
        cmdProcess.StartInfo = cmdStartInfo;
        cmdProcess.Start();

        Mono.Unix.Native.Syscall.dup2(stdout, 1);
        Mono.Unix.Native.Syscall.close(stdout);

        Stream s = new UnixStream(reading);
        byte[] buf = new byte[1024];
        int bytes = 0;
        int current;
        while((current = s.Read(buf, 0, buf.Length)) > 0)
        {
            bytes += current;
        }
        Mono.Unix.Native.Syscall.close(reading);
        Console.WriteLine("{0} bytes read", bytes);
    }
}

Under unix, file descriptors are inherited by child processes unless marked otherwise (close on exec). So, to redirect stdout of a child, all you need to do is change the file descriptor #1 in the parent process before calling exec. Unix also provides a handy thing called a pipe which is a unidirectional communication channel, with two file descriptors representing the two endpoints. For duplicating file descriptors, you can use dup or dup2 both of which create an equivalent copy of a descriptor, but dup returns a new descriptor allocated by the system and dup2 places the copy in a specific target (closing it if necessary). What the above code does, then:

  1. Creates a pipe with endpoints reading and writing
  2. Saves a copy of the current stdout descriptor
  3. Assigns the pipe's write endpoint to stdout and closes the original
  4. Starts the child process so it inherits stdout connected to the write endpoint of the pipe
  5. Restores the saved stdout
  6. Reads from the reading endpoint of the pipe by wrapping it in a UnixStream

Note, in native code, a process is usually started by a fork+exec pair, so the file descriptors can be modified in the child process itself, but before the new program is loaded. This managed version is not thread-safe as it has to temporarily modify the stdout of the parent process.

Since the code starts the child process without managed redirection, the .NET runtime does not change any descriptors or create any streams. So, the only reader of the child's output will be the user code, which uses a UnixStream to work around the StreamReader's encoding issue,

Jester
  • 56,577
  • 4
  • 81
  • 125
  • Can you comment on (1) how the pipe gets attached to the new process' stdout, and (2) how this works around the issue where the StreamReader buffers some bytes on its creation? – cdhowie Dec 23 '10 at 02:56
1

I checked out what's happening with reflector. It seems to me that StreamReader doesn't read until you call read on it. But it's created with a buffer size of 0x1000, so maybe it does. But luckily, until you actually read from it, you can safely get the buffered data out of it: it has a private field byte[] byteBuffer, and two integer fields, byteLen and bytePos, the first means how many bytes are in the buffer, the second means how many have you consumed, should be zero. So first read this buffer with reflection, then create the BinaryReader.

fejesjoco
  • 11,763
  • 3
  • 35
  • 65
  • Oh now I see, you call EndOfStream, that really causes a buffered read. So like Bradley suggested, don't do that, and you'll be fine without messing with private fields. – fejesjoco Dec 27 '10 at 07:54
1

Maybe you can try like this:

public class ThirdExe
{
    private static TongueSvr _instance = null;
    private Diagnostics.Process _process = null;
    private Stream _messageStream;
    private byte[] _recvBuff = new byte[65536];
    private int _recvBuffLen;
    private Queue<TonguePb.Msg> _msgQueue = new Queue<TonguePb.Msg>();
    void StartProcess()
    {
        try
        {
            _process = new Diagnostics.Process();
            _process.EnableRaisingEvents = false;
            _process.StartInfo.FileName = "d:/code/boot/tongueerl_d.exe"; // Your exe
            _process.StartInfo.UseShellExecute = false;
            _process.StartInfo.CreateNoWindow = true;
            _process.StartInfo.RedirectStandardOutput = true;
            _process.StartInfo.RedirectStandardInput = true;
            _process.StartInfo.RedirectStandardError = true;
            _process.ErrorDataReceived += new Diagnostics.DataReceivedEventHandler(ErrorReceived);
            _process.Exited += new EventHandler(OnProcessExit);
            _process.Start();
            _messageStream = _process.StandardInput.BaseStream;
            _process.BeginErrorReadLine();
            AsyncRead();

        }
        catch (Exception e)
        {
            Debug.LogError("Unable to launch app: " + e.Message);
        }

    private void AsyncRead()
    {
        _process.StandardOutput.BaseStream.BeginRead(_recvBuff, 0, _recvBuff.Length
                , new AsyncCallback(DataReceived), null);
    }

    void DataReceived(IAsyncResult asyncResult)
    {
        int nread = _process.StandardOutput.BaseStream.EndRead(asyncResult);
        if (nread == 0)
        {
            Debug.Log("process read finished"); // process exit
            return;
        }
        _recvBuffLen += nread;
        Debug.LogFormat("recv data size.{0}  remain.{1}", nread, _recvBuffLen);
        ParseMsg();
        AsyncRead();
    }
    void ParseMsg()
    {
        if (_recvBuffLen < 4)
        {
            return;
        }
        int len = IPAddress.NetworkToHostOrder(BitConverter.ToInt32(_recvBuff, 0));
        if (len > _recvBuffLen - 4)
        {
            Debug.LogFormat("current call can't parse the NetMsg for data incomplete");
            return;
        }
        TonguePb.Msg msg = TonguePb.Msg.Parser.ParseFrom(_recvBuff, 4, len);
        Debug.LogFormat("recv msg count.{1}:\n {0} ", msg.ToString(), _msgQueue.Count + 1);
        _recvBuffLen -= len + 4;
        _msgQueue.Enqueue(msg);
    }

The key is _process.StandardOutput.BaseStream.BeginRead(_recvBuff, 0, _recvBuff.Length, new AsyncCallback(DataReceived), null); and the very very important is that convert to asynchronous reads event like Process.OutputDataReceived.

IclodQ
  • 101
  • 4