How to open PDF and read it?

Question

how can I open a PDF file and read some of it's contents with Python (this language is preferred, however Ruby, Perl or PHP are fine too) (in case it is recognized (not just an image)) or report that it's impossible without OCR? TIA

Update: thanks for the solutions, I'm sure some of them will suit me fine.

@RichH, I have a pdf file, and don't know whether it is image- or text-based. I'm looking for a tool to help me find that out and in case it's text-based extract some of it's contents.

Are they image PDF files or a text PDF files (you can find out by trying to copy the text out manually)? What do you want to read? Text? Images? Layout? You may want to reword your question too - I didn't understand the second half. — RichH, Nov 08 '09 at 20:07
This link can help you: http://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text. And it's `its` contents ;-) — RedGlyph, Nov 08 '09 at 20:13

score 5 · Answer 1 · answered Nov 08 '09 at 20:49

5

For Perl, check out these modules:

answered Nov 08 '09 at 20:49

Ether

53,118
13
86
159

score 1 · Answer 2 · answered Nov 08 '09 at 20:18

Parsing PDF and making something useful out of it is hard as the format is focused on keeping the layout so text can be stored in a way that each letter is positioned individually, depending on the font the text might also be stored as graphic.

libraries to read PDFs I know include the Zend Framework which has a PDF component which includes a PDF parser which can be used from PHP and gives more or less usaable results and the commercial PDFlib which offers quite usable results and offers binding to different languages.

How to open PDF and read it?

2 Answers2

Linked