How do I convert a PDF to text so I can parse that text with PHP?

Question

I have PDFs that are mostly simply formatted text. I would like to parse the text with PHP. I realize that the PDF is binary so I need a utility or library to convert it to text.

Any recommendations?

What do you mean? To get the binary data of the PDF file, `file_get_contents()` will do fine. — Pekka, Jun 23 '11 at 09:01
If you want to edit PDF files, have a look at this question: http://stackoverflow.com/questions/7364/pdf-editing-in-php — Mike, Jun 23 '11 at 09:07
This question may help you - http://stackoverflow.com/questions/1004478/read-pdf-files-with-php — Sourabh, Jun 23 '11 at 09:08
i clarified my goal. I have PDFs that are mostly simply formatted text and i want to parse the text with PHP. — T. Brian Jones, Jun 23 '11 at 09:16

score 4 · Accepted Answer · answered Nov 06 '12 at 05:38

I ended up using XPDF ( which includes pdftotext ). This works great and I use it in production to extract text from millions of PDFs being uploaded to our servers.

Below is the install process for Linux CentOS:

download version 3.03 from here: http://foolabs.com/xpdf/download.html
tar -zxvf xpdfbin-linux-3.03.tar.gz ( extract tar.gz )
create required directories for install ( some or all of these might exist already )
- sudo mkdir /usr/local/man/
- sudo mkdir /usr/local/man/man1/
- sudo mkdir /usr/local/man/man5/
- sudo mkdir /usr/local/etc/xpdfrc/
move files from extracted folders ( cd into the folder where xpdf was just unzipped )
- move all the executables from the bin64 directory (xpdf, pdftotext ... all the files ) to /usr/local/bin/
- move the sample-xpdfrc file to /usr/local/etc/xpdfrc ( this can be used as is )
- move the manual pages from the doc directory ( *.1 to /usr/local/man/man1/ & *.5 to /usr/local/man/man5/ )
xpdf should be installed and ready to use
you can delete the downloaded tar.gz file and the folder where it was unzipped

score 4 · Answer 2 · answered Jun 23 '11 at 09:32

4

Third party software can dump the text contents of a PDF file, for example:

xdoc2txt (Windows-only, used in WinMerge plugins)
pdftotext, part of Xpdf

answered Jun 23 '11 at 09:32

Benoit

76,634
23
210
236

score 1 · Answer 3 · answered Jun 23 '11 at 09:15

You can't do that with file_get_contents() because PDF files contain only binary data (no plain text). To read / modify a pdf file you can use some third-party libraries. Take a look at:

And don't forget

http://php.net/manual/en/book.pdf.php

How do I convert a PDF to text so I can parse that text with PHP?

3 Answers3

Linked