43

Hi I know about several PDF Generators for php (fpdf, dompdf, etc.) What I want to know is about a parser.

For reasons beyond my control, certain information I need is only in a table inside a pdf and I need to extract that table and convert it to an array.

Any suggestions?

elviejo79
  • 4,592
  • 2
  • 32
  • 35
  • 2
    I am giving a bounty to anyone who can give us a working example on how to extract the text of a pdf. The solution has to use free libraries (no xPDF or PDF2Text) and platform independent (must work on win and unix, so no PDF2Text). It can use the exec() or shell() function of PHP. – 2ndkauboy Aug 31 '10 at 11:50
  • Thanks Kau-Boy. Maybe a bounty will help motivate more detailed answers. – elviejo79 Sep 01 '10 at 04:39
  • For reference, there is a better PDF parser here: https://github.com/smalot/pdfparser – Adrian P. Apr 27 '22 at 13:32

7 Answers7

31

I've written one before (for similar needs), and I can say this: Have fun. It's quite a complex task. The PDF specification is large and unwieldy. There are several methods of storing text inside of it. And the kicker is that each PDF generator is different in how it works. So while something like TFPDF or DOMPDF creates REALLY easy to read PDFs (from a machine standpoint), Acrobat makes some really hellish documents.

The reason is how it writes the text. Most DOM based renderers --that I've used-- write the entire line as one string, and position it once (which is really easy to read). Acrobat tries to be more efficient (and it is) by writing only one or maybe a few characters at a time, and positioning them independently. While this REALLY simplifies rendering, it makes reading MUCH more difficult.

The up side here, is that the PDF format in itself is really simple. You have "objects" that follow a regular syntax. Then you can link them together to generate the content. The specification does a good job at describing the file format. But real world reading is going to take a bit of brain power...

Some helpful pieces of advice that I had to learn the hard way if you're going to write it yourself:

  1. Adobe likes to re-map fonts. So character 65 will likely not be A... You need to find a map object and deduce what it's doing based upon what characters are in there. And it is efficient since if a character doesn't appear in the document for that font, it doesn't include it (which makes life difficult if you try to programmatically edit a PDF)...
  2. Write it as abstract as possible. Write classes for each object type, and each native type (strings, numbers, etc). Let those classes parse for you. There will be a fair bit of repetition in there, but you'll save yourself in the end when you realize that you need to tweak something for only one specific type)...
  3. Write for a specific version or two of the PDF spec, and enforce it. Check the version number, and if it's higher than you expect, bail... And don't try to "make it work". If you want to support newer versions, break out the specification and upgrade the parser from there. Don't try to trial and error your way up (it's not fun)...
  4. Good luck with compressed streams. I've found that typically you can't trust the length arguments to verify what you are uncompressing. Sometimes (for some generators) it works well... Others it's off by one or more bytes. I just attempt to deflate it if the filter matches, and then force the length...
  5. When testing lengths, don't use strlen. Use mb_strlen($string, '8bit') since it will compensate for different character sets (and allow potentially invalid characters in other charsets).

Otherwise, best of luck...

ircmaxell
  • 163,128
  • 34
  • 264
  • 314
  • 2
    +1 I might even call it nightmareish. The spec is huge, a PDF file almost resembles a filesystem with so many different options and choices within... you can certainly see how they can hide jail-breaking ability in there. – Rudu Aug 31 '10 at 22:07
  • 2
    Would you expect anything less from Adobe? – bpeterson76 Sep 02 '10 at 15:49
  • @bpeterson76, yes.. I don't want my PDFs to be downloadable :( – Ravi Dhoriya ツ Feb 13 '14 at 10:32
17

I use PDFBox for that (http://pdfbox.apache.org/). This software is javabased and platform independend. It works fast and reliable. You can use it via exec or shell execute or via a PHP/Java-Bridge (http://php-java-bridge.sourceforge.net/)

Timo Haberkern
  • 4,409
  • 2
  • 27
  • 41
3

Have you already looked at xPDF ? There is a program in there called pdftotext that will do the conversion. You can call it from PHP and then read in the text version of the PDF. You will need to have the ability to run exec() or system() from php, so this may not work on all hosted solutions though.

Also, there are some examples on the PHP site that will convert PDF to text, although its pretty rough. You may want to try some of those examples as well. On that PHP page, search for luc at phpt dot org.

ryanday
  • 2,506
  • 18
  • 25
  • I tried out xpdf based on your recommendation, and was surprised how well it works - thanks! – Tomba Feb 04 '11 at 17:07
  • As of 2022-Jul-2, the links to the PHP site are 404 (http://us3.php.net/manual/en/ref.pdf.php ) and I can't find the equivalent page. – Rick Hellewell Jul 02 '22 at 22:43
2

Zend_Pdf is part of the Zend Framework. Their manual states:

The Zend_Pdf component is a PDF (Portable Document Format) manipulation engine. It can load, create, modify and save documents. Thus it can help any PHP application dynamically create PDF documents by modifying existing documents or generating new ones from scratch.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
1

Have a look at GhostScript or ITextSharp, there are various cross-platform version of both.

Mark Redman
  • 24,079
  • 20
  • 92
  • 147
1

This is PHP PDF parser, which exists in two flavours:

  1. Free version can parse PDFs up to format PDF 1.5
  2. Commercial add-on can parse any PDF format (up to current 1.9)
Pranav 웃
  • 8,469
  • 6
  • 38
  • 48
lubosdz
  • 11
  • 1
0

It may not actually be a table inside the PDF as the PDF loses that sort of information...

mark stephens
  • 449
  • 3
  • 2