6

I have PDFs that are mostly simply formatted text. I would like to parse the text with PHP. I realize that the PDF is binary so I need a utility or library to convert it to text.

Any recommendations?

d-cubed
  • 1,034
  • 5
  • 30
  • 58
T. Brian Jones
  • 13,002
  • 25
  • 78
  • 117

3 Answers3

4

I ended up using XPDF ( which includes pdftotext ). This works great and I use it in production to extract text from millions of PDFs being uploaded to our servers.

Below is the install process for Linux CentOS:

  1. download version 3.03 from here: http://foolabs.com/xpdf/download.html
  2. tar -zxvf xpdfbin-linux-3.03.tar.gz ( extract tar.gz )
  3. create required directories for install ( some or all of these might exist already )
    • sudo mkdir /usr/local/man/
    • sudo mkdir /usr/local/man/man1/
    • sudo mkdir /usr/local/man/man5/
    • sudo mkdir /usr/local/etc/xpdfrc/
  4. move files from extracted folders ( cd into the folder where xpdf was just unzipped )
    • move all the executables from the bin64 directory (xpdf, pdftotext ... all the files ) to /usr/local/bin/
    • move the sample-xpdfrc file to /usr/local/etc/xpdfrc ( this can be used as is )
    • move the manual pages from the doc directory ( *.1 to /usr/local/man/man1/ & *.5 to /usr/local/man/man5/ )
  5. xpdf should be installed and ready to use
  6. you can delete the downloaded tar.gz file and the folder where it was unzipped
T. Brian Jones
  • 13,002
  • 25
  • 78
  • 117
4

Third party software can dump the text contents of a PDF file, for example:

  • xdoc2txt (Windows-only, used in WinMerge plugins)
  • pdftotext, part of Xpdf
Benoit
  • 76,634
  • 23
  • 210
  • 236
1

You can't do that with file_get_contents() because PDF files contain only binary data (no plain text). To read / modify a pdf file you can use some third-party libraries. Take a look at:

And don't forget