7

I am using pdfgrep to search all appearances of a keyword in a PDF Document.

Now, I want to do this via PHP so I can use this in my Web Site.

However, when I run:

$output = shell_exec("pdfgrep -i $keyword $file");
$var_dump($output);

Where $keyword is the keyword and $file is the file, I don't get the entire output.

The PDF is made up of a table of product codes, product names, and product prices.

When I execute the command via Terminal, I'm able to see the entire row of data:

product code 1    product name with keyword substring    corresponding price
product code 2    product name with keyword substring    corresponding price
product code 3    product name with keyword substring    corresponding price

However, when I ran it via PHP, I got something like:

name with keyword substring with keyword substring product code 1 
product name with keyword substring product name with keyword substring 
corresponding price

It just does not get all the data. It doesn't always get the product code and the price, and there has been a lot of instances where it doesn't get the entire product name as well.

I view the output via browser and put in header('Content-Type: text/plain'); but it only prettifies the output, the data is still incomplete.

I've tried to run the exact same shell script via Python3.6 and that gave me the output I desired.

Now, I've tried to run the same Python script via PHP but I still get the same broken output.

I've tried to run a keyword that I know would return a shorter output, but I still don't get the entire data line that I need.

Is there any way to reliably get all the data thrown by the shell_exec() command?

Are there alternatives available such as a different command, or running a Python script from a server (since the Python script doesn't have any issues anyway).

Daniel W.
  • 31,164
  • 13
  • 93
  • 151
Razgriz
  • 7,179
  • 17
  • 78
  • 150
  • How do you execute (or view) the PHP output? From console? From browser? The most versatile function to execute is `proc_open()` - all pipes and stuff easily configurable. – Daniel W. May 08 '19 at 13:43
  • I view it view browser. – Razgriz May 08 '19 at 22:47
  • I have read several times about output beeing truncated. Maybe because of hidden stop-characters or a buffer beeing too small, or an internal race condition. I would love to see an answer. [Linked question with the same problem.](https://stackoverflow.com/questions/17052760/output-of-shell-exec-gets-truncated-to-100-characters) – Daniel W. May 09 '19 at 09:35
  • What about storing the output in a file ? `shell_exec("pdfgrep -i $keyword $file > ". __DIR__ . "/output.log");` – DarkBee May 13 '19 at 11:35
  • 1
    Perhpas it has to do with encoding, in which case check this [question](https://stackoverflow.com/questions/13961870/php-exec-change-encoding) on SO – Webber May 13 '19 at 11:48
  • you not need run shell_exec() with SU privilege?? –  May 13 '19 at 18:41
  • Try to run the same command via console php. This should be your first approach instead of browser one – Dilyan Trayanov May 14 '19 at 13:14
  • Can you upload a sample pdf document along with the expected output? I will have to test things locally. – itsben May 14 '19 at 14:12
  • I have taken a look at this on my machine and cannot see the issue. There must be some special characters in the pdf document. Would please upload the document for analysis? – itsben May 16 '19 at 20:35
  • Also, there should not be a dollar sign in front of the var_dump command. – itsben May 16 '19 at 20:38
  • A few quick questions: 1. If you run `top` while the `exec` is called, do you see any CPU spikes? 2. How long does the call take to return? – Ben D May 19 '19 at 03:19
  • Have you tried this `$output = exec("pdfgrep -i $keyword $file 2>&1"); var_dump($output);` ?? – Rohit Mittal May 19 '19 at 18:23
  • @Razgriz did anything change something regarding your issue? – Daniel W. May 21 '19 at 00:04
  • @DanielW. I decided to go with the Python solution, but I might get back to trying this in a couple days as I'm having problems deploying my Python Flask app on my server. – Razgriz May 21 '19 at 01:02

3 Answers3

6

I don't know how pdfgrep works but maybe it mixes stdout and stderr? Either way, you could use a construction like this, where you capture the output stream into an output buffer, optionally also mixing stderr into stdout:

$mixStdErrIntoStdOut = false;

ob_start();
$exitCode = 0;
if ($mixStdErrIntoStdOut) 
{
    system("pdfgrep -i $keyword $file 2>&1", &$exitCode);
} else {
    system("pdfgrep -i $keyword $file", &$exitCode);
}
$output = ob_get_clean();

var_dump($output);
jancha
  • 4,916
  • 1
  • 24
  • 39
Kris
  • 40,604
  • 9
  • 72
  • 101
3

There are number of ways how you can execute the process and gather the output. If you can consistently repeat the problem, you may try other process execution methods:

1) exec($command, &$output)

$output = [];
exec($command, $output);

this should push all of the output, line-by-line, in your $output array, that has to be instantiated before calling this method.

2) passthru($command)

this would pass back into the output buffer all the output of the command. so to use this you need to use output buffer:

ob_start();
passthru($command);
$contents = ob_get_contents();
ob_end_clean();

3) popen($command, "r");

$output = "";
$handle = popen($command, "r");
while (!feof($handle)){
    $output .= fread($handle, 4096);
}

Let me know what you get by calling each of the methods.

Also, make sure to check stderror for errors.

Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
jancha
  • 4,916
  • 1
  • 24
  • 39
2

I had exactly same problem, in a PDF vault of digitalized contracts.

Function "exec", "shell_exec" and "passthru" outputs did works so randomly, that I had to opt for a creative solution: use ssh2_connect and ssh2_exec to connect to self machine.

Omiting SSH connection part (https://www.php.net/manual/es/function.ssh2-connect.php), the code is:

// command execution
$stream     = ssh2_exec($connection, "pdfgrep -i {$keyword} {$file}");
// set stream block: queue executions to avoid overlapping
stream_set_blocking($stream, true);
// catch stream block
$stream_out = ssh2_fetch_stream($stream, SSH2_STREAM_STDIO);
// output
return stream_get_contents($stream_out);

It may be a little complicated at first, but in the long term there are more benefits:

  1. Each execution is limited to the user permissions and not to the permissions of www-data
  2. If you want to make an "extract cache" with PDF content, the new files will be created by the correct user
  3. If the PDF vault grows so much that it must be externalized to a CDN, with this method it is enough to connect to the host to which it is going to search.
  4. And most important, you will get the complete stream_out

See ya!

Benjamin
  • 558
  • 7
  • 15