0

Anyone know a simple way to "read"/extract keywords from .pdf file? This file is not password protected and it was generated on the same server usinf FPDF class.

I know there is some "powerful" tool (not free) to manipulate .pdf that provide a simple way to get out all the metadata.

I also know that .pdf store all metadata inside << >> character, using the special character / before the name of metadata to identify that. What I need is the string after the "/Keywords" and store in a variable.

Any idea to parse and get only that string?

(currently I'm writing a JSON string inside keywords, so it's look like this: ([{"FirstName":"7bis","LastName":"lastName","email":"email@email.com"}]) )

Opening the pdf file with a text editor looks like:

/F1 6 0 R
>>
/XObject <<
>>
>>
endobj
7 0 obj
<<
/Keywords ([{"FirstName":"7bis","LastName":"lastName","email":"email@email.com"}])
/Producer (FPDF 1.81)
/CreationDate (D:20160531084015)
>>
endobj

Thanks for all suggestion ;)

Andrea
  • 67
  • 2
  • 10

2 Answers2

1

finally after some "coding" and some reading about general parsing, I found a way to extract what I need. Actually I'm opening the .pdf file and store it's as a string, then parsing the string and extract the content after Keywords

$file = "/directory/of/file/example.pdf";
$stringedPDF = file_get_contents($file, true);
preg_match('/(?<=Keywords )\S+/i', $stringedPDF, $match);
return $match[0];

I'm pretty sure we can do some "tuning", because the "metadata" are always "near the end" of file. It will be nice take only the "last" part of file without save all the file into the string, this is going to save a lot of time specially on big .pdf file size.

Andrea
  • 67
  • 2
  • 10
0

You may try below code from source

$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('document.pdf');
$text   = $pdf->getDetails();
Community
  • 1
  • 1
Ankit Doshi
  • 1,164
  • 3
  • 21
  • 43
  • Actually I would like to avoid any Class that require Composer like the one that you have suggested. I tried the online demo and works as I need but I cen't use composer at the moment, and seams there is no way to "include" this class without it. – Andrea May 31 '16 at 12:16