13

I'm trying to read metadata attached to arbitrary PDFs: title, author, subject, and keywords.

Is there a PHP library, preferably open-source, that can read PDF metadata? If so, or if there isn't, how would one use the library (or lack thereof) to extract the metadata?

To be clear, I'm not interested in creating or modifying PDFs or their metadata, and I don't care about the PDF bodies. I've looked at a number of libraries, including FPDF (which everyone seems to recommend), but it appears only to be for PDF creation, not metadata extraction.

  • @ircmaxell I apologize for not making it clearer that I'm really looking for a workable solution. Do you have an example of how one could extract the metadata, library or otherwise? –  Dec 20 '10 at 19:42
  • I know, which is why I commented. I don't have or know of any tools to do it, I was just commenting that if all else fails writing your own shouldn't be too hard... – ircmaxell Dec 20 '10 at 19:43

5 Answers5

12

PDF Parser does exactly what you want and it's pretty straightforward to use:

$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('document.pdf');
$text   = $pdf->getDetails();

You can try it in the demo page.

Alessandro Cosentino
  • 2,268
  • 1
  • 21
  • 30
9

The Zend framework includes Zend_Pdf, which makes this really easy:

$pdf = Zend_Pdf::load($pdfPath);

echo $pdf->properties['Title'] . "\n";
echo $pdf->properties['Author'] . "\n";

Limitations: Works only on files without encryption smaller then 16MB.

Community
  • 1
  • 1
  • There's also a bunch of PDFs that Zend_Pdf dies a horrible death on. PDFs saved in old PDF versions (like 1.4) are usually safe. – chrishiestand Nov 21 '13 at 09:25
  • 4
    it seems as of 5/27/2019 this library no longer exists – Chucky May 28 '19 at 05:48
  • Zend Framework 1 is available as a fork that now runs in PHP 7 and 8. See: https://github.com/Shardj/zf1-future Zend PDF is here: https://github.com/Shardj/zf1-future/tree/master/library/Zend/Pdf – WebTigers Jul 07 '22 at 10:02
  • The Zend PDF docs can be found here: https://framework.zend.com/manual/1.12/en/zend.pdf.html – WebTigers Jul 07 '22 at 10:22
3
<?php 

    $sourcefile = "file path";
    $stringedPDF = file_get_contents($sourcefile, true);

    preg_match('/(?<=Title )\S(?:(?<=\().+?(?=\))|(?<=\[).+?(?=\]))./', $stringedPDF, $title);
    echo $all = $title[0];
joan16v
  • 5,055
  • 4
  • 49
  • 49
  • 2
    this solution requires you to read the full pdf on memory. I have to get the title from some 800MB pdfs, for example – Einacio Nov 08 '17 at 19:33
1

I was looking for the same thing today. And I came across a small PHP class over at http://de77.com/ that offers a quick and dirty solution. You can download the class directly. Output is UTF-8 encoded.

The creator says:

Here’s a PHP class I wrote which can be used to get title & author and a number of pages of any PDF file. It does not use any external application - just pure PHP.

// basic example
include 'PDFInfo.php';
$p = new PDFInfo;
$p->load('file.pdf');
echo $p->author;
echo $p->title;
echo $p->pages;

For me, it work's! All thanks goes solely to the creator of the class ... well, maybe just a little bit thanks to me too for finding the class ;)

maxpower9000
  • 223
  • 2
  • 8
  • 1
    that class is too less efficient and doesn't read many pdf metadata. using pdfinfo in linux you can extract metadata that PDFInfo doesn't do so I think would be another library to make better it. – Néstor Apr 21 '14 at 17:11
  • Unfortunately, the link is now down, but it looked to be the easiest solution (especially since I only need the title...). @Néstor : why do you say it is "less efficient"? – brclz Jan 01 '17 at 23:17
  • @brclz this post is too old and already was answered, and what I remembered it didn't worked for what I was doing. The link provided is dead because the author has killed the link. – Néstor Jan 05 '17 at 18:22
  • The class loses because it uses a hard-coded `dc:` xmlns prefix for DublinCore. XML specs state the prefix is arbitrary, e.g. `Author` is *exactly* equivalent to `Author`. – amphetamachine Dec 05 '21 at 21:07
1

You may use PDFtk to extract the page count:

// Windows
$bin = realpath('C:\\pdftk\\bin\\pdftk.exe');
$cmd = "cmd /c {$bin} {$path} dump_data | grep NumberOfPages | sed 's/[^0-9]*//'";

// Unix
$cmd = "pdftk {$path} dump_data | grep NumberOfPages | sed 's/[^0-9]*

If ImageMagick is available you may also use:

$cmd = "identify -format %n {$path}";

Execute in PHP via shell_exec():

$res = shell_exec($cmd);
maxpower9000
  • 223
  • 2
  • 8