I am trying to integrate use of samtools
into a C program. This application reads data in a binary format called BAM, e.g. from stdin
:
$ cat foo.bam | samtools view -h -
...
(I realize this is a useless use of cat
, but I'm just showing how a BAM file's bytes can be piped to samtools
on the command line. These bytes could come from other upstream processes.)
Within a C program, I would like to write chunks of unsigned char
bytes to the samtools
binary, while simultaneously capturing the standard output from samtools
after it processes these bytes.
Because I cannot use popen()
to simultaneously write to and read from a process, I looked into using publicly-available implementations of popen2()
, which appears to be written to support this.
I wrote the following test code, which attempts to write()
4 kB chunks bytes of a BAM file located in the same directory to a samtools
process. It then read()
s bytes from the output of samtools
into a line buffer, printed to standard error:
#include <sys/types.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#define READ 0
#define WRITE 1
pid_t popen2(const char *command, int *infp, int *outfp)
{
int p_stdin[2], p_stdout[2];
pid_t pid;
if (pipe(p_stdin) != 0 || pipe(p_stdout) != 0)
return -1;
pid = fork();
if (pid < 0)
return pid;
else if (pid == 0)
{
close(p_stdin[WRITE]);
dup2(p_stdin[READ], READ);
close(p_stdout[READ]);
dup2(p_stdout[WRITE], WRITE);
execl("/bin/sh", "sh", "-c", command, NULL);
perror("execl");
exit(1);
}
if (infp == NULL)
close(p_stdin[WRITE]);
else
*infp = p_stdin[WRITE];
if (outfp == NULL)
close(p_stdout[READ]);
else
*outfp = p_stdout[READ];
return pid;
}
int main(int argc, char **argv)
{
int infp, outfp;
/* set up samtools to read from stdin */
if (popen2("samtools view -h -", &infp, &outfp) <= 0) {
printf("Unable to exec samtools\n");
exit(1);
}
const char *fn = "foo.bam";
FILE *fp = NULL;
fp = fopen(fn, "r");
if (!fp)
exit(-1);
unsigned char buf[4096];
char line_buf[65536] = {0};
while(1) {
size_t n_bytes = fread(buf, sizeof(buf[0]), sizeof(buf), fp);
fprintf(stderr, "read\t-> %08zu bytes from fp\n", n_bytes);
write(infp, buf, n_bytes);
fprintf(stderr, "wrote\t-> %08zu bytes to samtools process\n", n_bytes);
read(outfp, line_buf, sizeof(line_buf));
fprintf(stderr, "output\t-> \n%s\n", line_buf);
memset(line_buf, '\0', sizeof(line_buf));
if (feof(fp) || ferror(fp)) {
break;
}
}
return 0;
}
(For a local copy of foo.bam
, here is a link to the binary file I am using for testing. But any BAM file would do for testing purposes.)
To compile:
$ cc -Wall test_bam.c -o test_bam
The problem is that the procedure hangs after the write()
call:
$ ./test_bam
read -> 00004096 bytes from fp
wrote -> 00004096 bytes to samtools process
[bam_header_read] EOF marker is absent. The input is probably truncated.
If I close()
the infp
variable immediately after the write()
call, then the loop goes through one more iteration before hanging:
...
write(infp, buf, n_bytes);
close(infp); /* <---------- added after the write() call */
fprintf(stderr, "wrote\t-> %08zu bytes to samtools process\n", n_bytes);
...
With the close()
statement:
$ ./test_bam
read -> 00004096 bytes from fp
wrote -> 00004096 bytes to samtools process
[bam_header_read] EOF marker is absent. The input is probably truncated.
[main_samview] truncated file.
output ->
@HD VN:1.0 SO:coordinate
@SQ SN:seq1 LN:5000
@SQ SN:seq2 LN:5000
@CO Example of SAM/BAM file format.
read -> 00004096 bytes from fp
wrote -> 00004096 bytes to samtools process
With this change, I get some output that I'd otherwise expect to get if I ran samtools
on the command line, but as mentioned, the procedure hangs once again.
How does one use popen2()
to write and read data in chunks to internal buffers? If this isn't possible, are there alternatives to popen2()
that would work better for this task?