7

Is it possible to call CLI tools like pdftotext, antiword, catdoc (text extractor scripts) passing a string instead of a file?

Currently, I read PDF files calling pdftotext with child_process.spawn. I spawn a new process and store the result in a new variable. Everything works fine.

I’d like to pass the binary from a fs.readFile instead of the file itself:

fs.readFile('./my.pdf', (error, binary) => {
    // Call pdftotext with child_process.spawn passing the binary.
    let event = child_process.spawn('pdftotext', [
        // Args here!
    ]);
});

How can I do that?

Palec
  • 12,743
  • 8
  • 69
  • 138

1 Answers1

2

It's definitely possible, if the command can handle piped input.

spawn returns a ChildProcess object, you can pass the string (or binary) in memory to it by write to its stdin. The string should be converted to a ReadableStream first, then you can write the string to stdin of the CLI by pipe.

createReadStream creates a ReadableStream from a file.

The following example download a pdf file and pipe the content to pdftotext, then show first few bytes of the result.

const source = 'http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf'
const http = require('http')
const spawn = require('child_process').spawn

download(source).then(pdftotext)
.then(result => console.log(result.slice(0, 77)))

function download(url) {
  return new Promise(resolve => http.get(url, resolve))
}

function pdftotext(binaryStream) {
  //read input from stdin and write to stdout
  const command = spawn('pdftotext', ['-', '-'])
  binaryStream.pipe(command.stdin)

  return new Promise(resolve => {
    const result = []
    command.stdout.on('data', chunk => result.push(chunk.toString()))
    command.stdout.on('end', () => resolve(result.join('')))
  })
}

For CLIs have no option to read from stdin, you can use named pipes.

Edit: Add another example with named pipes.

Once the named pipes are created, you can use them like files. The following example creates temporary named pipes to send input and get output, and show first few bytes of the result.

const fs = require('fs')
const spawn = require('child_process').spawn

pipeCommand({
  name: 'wvText',
  input: fs.createReadStream('document.doc'),
}).then(result => console.log(result.slice(0, 77)))

function createPipe(name) {
  return new Promise(resolve =>
    spawn('mkfifo', [name]).on('exit', () => resolve()))
}

function pipeCommand({name, input}) {
  const inpipe = 'input.pipe'
  const outpipe = 'output.pipe'
  return Promise.all([inpipe, outpipe].map(createPipe)).then(() => {
    const result = []
    fs.createReadStream(outpipe)
    .on('data', chunk => result.push(chunk.toString()))
    .on('error', console.log)

    const command = spawn(name, [inpipe, outpipe]).on('error', console.log)
    input.pipe(fs.createWriteStream(inpipe).on('error', console.log))
    return new Promise(resolve =>
      command.on('exit', () => {
        [inpipe, outpipe].forEach(name => fs.unlink(name))
        resolve(result.join(''))
      }))
  })
}
Community
  • 1
  • 1
DarkKnight
  • 5,651
  • 2
  • 24
  • 36
  • Hei @DarkKnight, tranks a lot!! If i'm not askinh to much, could u provide a working exemple with named pipes? It turns out that i'm using other scripts that doesnt support the other method. –  Aug 23 '16 at 15:13
  • All of the tools you mentioned can accept `stdin` by specifying `-`. I added another example, anyway. – DarkKnight Aug 24 '16 at 11:56
  • Hei DarkKnight, some how i'm seeing ```events.js:160 throw er; // Unhandled 'error' event ^ Error: EPIPE: broken pipe, write at Error (native)``` now... do you know what could be this? –  Aug 25 '16 at 04:51
  • Most likely the command exits before consuming all input data(and EOF), to inspect it further, you can add `.on('error' ...` to streams and the child process. – DarkKnight Aug 25 '16 at 11:38
  • It's returning ```{ Error: EPIPE: broken pipe, write at Error (native) errno: -32, code: 'EPIPE', syscall: 'write' }```... is it working on your machine? –  Aug 25 '16 at 14:14
  • All of the examples works(node v6.4.0 on Linux). What command and input data did you use? Does it work with regular input file? – DarkKnight Aug 25 '16 at 14:41
  • Is working with regular input file. I'm using the antiword script with your script. I'm trying also to do the following: http://codepen.io/anon/pen/EyBLWa?editors=0110 (which is a example of i will do in my app) –  Aug 25 '16 at 14:47
  • If i try to read a "real" file, everything will work just fine. –  Aug 25 '16 at 14:48
  • `cat document.doc > input.pipe & antiword input.pipe`, `antiword` says: I can't get the size of 'input.pipe'. It tries to seek, but pipes are non-seekable. – DarkKnight Aug 25 '16 at 15:18
  • So, it's a antiword problem. Is it fixable? –  Aug 25 '16 at 15:24
  • There is something really strange.. i can't execute antiword, but i can read the file with fs.readFile? lol.. :X –  Aug 25 '16 at 15:44
  • This is really not that strange, @FXAMN. Pipes are not seekable, while ordinary files are. When reading a PDF file, you often need to seek, unless you are OK with reading the whole file into memory (in which case you may have hard time processing large PDFs). Many tools accept input from a pipe, but they write it immediately into a file and only then process it. – Palec Aug 29 '16 at 15:23