3

I am trying to integrate use of samtools into a C program. This application reads data in a binary format called BAM, e.g. from stdin:

$ cat foo.bam | samtools view -h -
...

(I realize this is a useless use of cat, but I'm just showing how a BAM file's bytes can be piped to samtools on the command line. These bytes could come from other upstream processes.)

Within a C program, I would like to write chunks of unsigned char bytes to the samtools binary, while simultaneously capturing the standard output from samtools after it processes these bytes.

Because I cannot use popen() to simultaneously write to and read from a process, I looked into using publicly-available implementations of popen2(), which appears to be written to support this.

I wrote the following test code, which attempts to write() 4 kB chunks bytes of a BAM file located in the same directory to a samtools process. It then read()s bytes from the output of samtools into a line buffer, printed to standard error:

#include <sys/types.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define READ 0
#define WRITE 1

pid_t popen2(const char *command, int *infp, int *outfp)
{
    int p_stdin[2], p_stdout[2];
    pid_t pid;

    if (pipe(p_stdin) != 0 || pipe(p_stdout) != 0)
        return -1;

    pid = fork();

    if (pid < 0)
        return pid;
    else if (pid == 0)
    {
        close(p_stdin[WRITE]);
        dup2(p_stdin[READ], READ);
        close(p_stdout[READ]);
        dup2(p_stdout[WRITE], WRITE);

        execl("/bin/sh", "sh", "-c", command, NULL);
        perror("execl");
        exit(1);
    }

    if (infp == NULL)
        close(p_stdin[WRITE]);
    else
        *infp = p_stdin[WRITE];

    if (outfp == NULL)
        close(p_stdout[READ]);
    else
        *outfp = p_stdout[READ];

    return pid;
}

int main(int argc, char **argv)
{
    int infp, outfp;

    /* set up samtools to read from stdin */
    if (popen2("samtools view -h -", &infp, &outfp) <= 0) {
        printf("Unable to exec samtools\n");
        exit(1);
    }

    const char *fn = "foo.bam";
    FILE *fp = NULL;
    fp = fopen(fn, "r");
    if (!fp)
        exit(-1);
    unsigned char buf[4096];
    char line_buf[65536] = {0};
    while(1) {
        size_t n_bytes = fread(buf, sizeof(buf[0]), sizeof(buf), fp);
        fprintf(stderr, "read\t-> %08zu bytes from fp\n", n_bytes);
        write(infp, buf, n_bytes);
        fprintf(stderr, "wrote\t-> %08zu bytes to samtools process\n", n_bytes);
        read(outfp, line_buf, sizeof(line_buf));
        fprintf(stderr, "output\t-> \n%s\n", line_buf);
        memset(line_buf, '\0', sizeof(line_buf));
        if (feof(fp) || ferror(fp)) {
            break;
        }
    }
    return 0;
}

(For a local copy of foo.bam, here is a link to the binary file I am using for testing. But any BAM file would do for testing purposes.)

To compile:

$ cc -Wall test_bam.c -o test_bam

The problem is that the procedure hangs after the write() call:

$ ./test_bam
read    -> 00004096 bytes from fp
wrote   -> 00004096 bytes to samtools process
[bam_header_read] EOF marker is absent. The input is probably truncated.

If I close() the infp variable immediately after the write() call, then the loop goes through one more iteration before hanging:

...
write(infp, buf, n_bytes);
close(infp); /* <---------- added after the write() call */
fprintf(stderr, "wrote\t-> %08zu bytes to samtools process\n", n_bytes);
...

With the close() statement:

$ ./test_bam
read    -> 00004096 bytes from fp
wrote   -> 00004096 bytes to samtools process
[bam_header_read] EOF marker is absent. The input is probably truncated.
[main_samview] truncated file.
output  -> 
@HD VN:1.0 SO:coordinate
@SQ SN:seq1 LN:5000
@SQ SN:seq2 LN:5000
@CO Example of SAM/BAM file format.

read    -> 00004096 bytes from fp
wrote   -> 00004096 bytes to samtools process

With this change, I get some output that I'd otherwise expect to get if I ran samtools on the command line, but as mentioned, the procedure hangs once again.

How does one use popen2() to write and read data in chunks to internal buffers? If this isn't possible, are there alternatives to popen2() that would work better for this task?

Dijkgraaf
  • 11,049
  • 17
  • 42
  • 54
Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345

2 Answers2

1

As an alternative to a pipe, why not communicate with samtools through a socket? Checking the samtools source, the file knetfile.c indicates that samtools has socket communications available:

#include "knetfile.h"

/* In winsock.h, the type of a socket is SOCKET, which is: "typedef
* u_int SOCKET". An invalid SOCKET is: "(SOCKET)(~0)", or signed
* integer -1. In knetfile.c, I use "int" for socket type
* throughout. This should be improved to avoid confusion.
*
* In Linux/Mac, recv() and read() do almost the same thing. You can see
* in the header file that netread() is simply an alias of read(). In
* Windows, however, they are different and using recv() is mandatory.
*/

That may provide a better option than using pipe2.

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
  • I think this would require importing samtools headers and libraries into this program. I don't see a way to run this binary on the command-line to put it into some kind of daemon mode that would accept connections via a socket. I would be looking for a way to do this through file streams, so that I can make use of the binary as the dependency. – Alex Reynolds Jun 19 '14 at 23:37
  • Fair enough. I saw the socket availability and thought I would offer that as a solution. – David C. Rankin Jun 19 '14 at 23:40
  • Sorry if it sounds like I'm ungrateful — I'm really not! I do appreciate all suggestions. I just wanted to clarify the problem I'm trying to solve. – Alex Reynolds Jun 19 '14 at 23:41
-2

This problem has nothing to do with the particular implementation of popen2. Also note that on OS X, popen lets you open a bidirectional pipe, this may be true on other BSD systems too. If this is to be portable, you'd need a configure check for whether popen allows bidirectional pipes (or something equivalent to a configure check).

You need to switch the pipes to non-blocking mode, and alternate between read and write calls in an endless loop. Such loop, in order not to waste the CPU when the samtools process is busy, needs to use select, poll or a similar mechanism that blocks for a file descriptor to become "available" (more data to read, or ready to accept data for writing).

See this question for some inspiration.

Community
  • 1
  • 1
Kuba hasn't forgotten Monica
  • 95,931
  • 16
  • 151
  • 313
  • I'm definitely looking for a portable solution. I'm not sure `popen` will work, if the specifications are as described. – Alex Reynolds Jun 19 '14 at 23:33
  • @AlexReynolds The defacto portable solution is to use `popen2` if the platform provides it, or use `popen` if it's bidirectional on a given platform, or use your own `popen2` if none of those work. And then you switch to nonblocking processing and all is good in the universe :) Configure checks are a must, in some incarnation. You can simply characterize all supported platforms and use a static configuration header that detects the platform and activates the correct implementation. That's what a lot of header-only C++ libraries do, and "compile along" C libs like sqlite amalgamation. – Kuba hasn't forgotten Monica Jun 20 '14 at 00:14
  • Would you know how to implement a blocking mechanism that would be relevant to my original question, which uses `popen2()`? It's looks like setting a non-blocking flag on the write descriptor helps, but some data are lost. The question you linked to doesn't offer much help here. I appreciate any details. – Alex Reynolds Jun 20 '14 at 00:22
  • @AlexReynolds "Some data are lost" - huh? How? Why? All you need to do is code up what the second paragraph says. It's a couple dozen lines. It's not that the non-blocking flag "helps", it is impossible to do what you intend without using either non-blocking access or multiple threads. – Kuba hasn't forgotten Monica Jun 20 '14 at 12:05