Reading PDF metadata in PHP

Question

I'm trying to read metadata attached to arbitrary PDFs: title, author, subject, and keywords.

Is there a PHP library, preferably open-source, that can read PDF metadata? If so, or if there isn't, how would one use the library (or lack thereof) to extract the metadata?

To be clear, I'm not interested in creating or modifying PDFs or their metadata, and I don't care about the PDF bodies. I've looked at a number of libraries, including FPDF (which everyone seems to recommend), but it appears only to be for PDF creation, not metadata extraction.

@ircmaxell I apologize for not making it clearer that I'm really looking for a workable solution. Do you have an example of how one could extract the metadata, library or otherwise? — , Dec 20 '10 at 19:42
I know, which is why I commented. I don't have or know of any tools to do it, I was just commenting that if all else fails writing your own shouldn't be too hard... — ircmaxell, Dec 20 '10 at 19:43

score 12 · Answer 1 · answered Mar 27 '14 at 22:41

12

PDF Parser does exactly what you want and it's pretty straightforward to use:

$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('document.pdf');
$text   = $pdf->getDetails();

You can try it in the demo page.

answered Mar 27 '14 at 22:41

Alessandro Cosentino

2,268
1
21
30

1

sounds better but demo page fails once you upload a pdf – Néstor Apr 21 '14 at 17:12
@Néstor It used to work when I posted the answer. I contacted the developer about the issue. – Alessandro Cosentino Apr 21 '14 at 19:45

score 9 · Accepted Answer · edited Aug 15 '11 at 11:02

9

The Zend framework includes Zend_Pdf, which makes this really easy:

$pdf = Zend_Pdf::load($pdfPath);

echo $pdf->properties['Title'] . "\n";
echo $pdf->properties['Author'] . "\n";

Limitations: Works only on files without encryption smaller then 16MB.

edited Aug 15 '11 at 11:02

Community

1
1

answered Dec 23 '10 at 16:44

There's also a bunch of PDFs that Zend_Pdf dies a horrible death on. PDFs saved in old PDF versions (like 1.4) are usually safe. – chrishiestand Nov 21 '13 at 09:25
4

it seems as of 5/27/2019 this library no longer exists – Chucky May 28 '19 at 05:48
Zend Framework 1 is available as a fork that now runs in PHP 7 and 8. See: https://github.com/Shardj/zf1-future Zend PDF is here: https://github.com/Shardj/zf1-future/tree/master/library/Zend/Pdf – WebTigers Jul 07 '22 at 10:02
The Zend PDF docs can be found here: https://framework.zend.com/manual/1.12/en/zend.pdf.html – WebTigers Jul 07 '22 at 10:22

score 3 · Answer 3 · edited Aug 03 '17 at 12:30

3

<?php 

    $sourcefile = "file path";
    $stringedPDF = file_get_contents($sourcefile, true);

    preg_match('/(?<=Title )\S(?:(?<=\().+?(?=\))|(?<=\[).+?(?=\]))./', $stringedPDF, $title);
    echo $all = $title[0];

edited Aug 03 '17 at 12:30

joan16v

5,055
4
49
49

answered Aug 03 '17 at 08:26

ved uniyalas

31
4

2

this solution requires you to read the full pdf on memory. I have to get the title from some 800MB pdfs, for example – Einacio Nov 08 '17 at 19:33

score 1 · Answer 4 · answered Mar 06 '13 at 10:48

1

I was looking for the same thing today. And I came across a small PHP class over at http://de77.com/ that offers a quick and dirty solution. You can download the class directly. Output is UTF-8 encoded.

The creator says:

Here’s a PHP class I wrote which can be used to get title & author and a number of pages of any PDF file. It does not use any external application - just pure PHP.

// basic example
include 'PDFInfo.php';
$p = new PDFInfo;
$p->load('file.pdf');
echo $p->author;
echo $p->title;
echo $p->pages;

For me, it work's! All thanks goes solely to the creator of the class ... well, maybe just a little bit thanks to me too for finding the class ;)

answered Mar 06 '13 at 10:48

maxpower9000

223
2
8

1

that class is too less efficient and doesn't read many pdf metadata. using pdfinfo in linux you can extract metadata that PDFInfo doesn't do so I think would be another library to make better it. – Néstor Apr 21 '14 at 17:11
Unfortunately, the link is now down, but it looked to be the easiest solution (especially since I only need the title...). @Néstor : why do you say it is "less efficient"? – brclz Jan 01 '17 at 23:17
@brclz this post is too old and already was answered, and what I remembered it didn't worked for what I was doing. The link provided is dead because the author has killed the link. – Néstor Jan 05 '17 at 18:22
The class loses because it uses a hard-coded `dc:` xmlns prefix for DublinCore. XML specs state the prefix is arbitrary, e.g. `Author` is *exactly* equivalent to `Author`. – amphetamachine Dec 05 '21 at 21:07

score 1 · Answer 5 · answered Jan 09 '17 at 10:29

1

You may use PDFtk to extract the page count:

// Windows
$bin = realpath('C:\\pdftk\\bin\\pdftk.exe');
$cmd = "cmd /c {$bin} {$path} dump_data | grep NumberOfPages | sed 's/[^0-9]*//'";

// Unix
$cmd = "pdftk {$path} dump_data | grep NumberOfPages | sed 's/[^0-9]*

If ImageMagick is available you may also use:

$cmd = "identify -format %n {$path}";

Execute in PHP via shell_exec():

$res = shell_exec($cmd);

answered Jan 09 '17 at 10:29

maxpower9000

223
2
8

There's also a **PHP-PDFtk**-Lib available: - https://github.com/mikehaertl/php-pdftk – maxpower9000 Jan 09 '17 at 10:33

Reading PDF metadata in PHP

5 Answers5

Linked