2

how can I open a PDF file and read some of it's contents with Python (this language is preferred, however Ruby, Perl or PHP are fine too) (in case it is recognized (not just an image)) or report that it's impossible without OCR? TIA

Update: thanks for the solutions, I'm sure some of them will suit me fine.

@RichH, I have a pdf file, and don't know whether it is image- or text-based. I'm looking for a tool to help me find that out and in case it's text-based extract some of it's contents.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
Fluffy
  • 27,504
  • 41
  • 151
  • 234
  • Are they image PDF files or a text PDF files (you can find out by trying to copy the text out manually)? What do you want to read? Text? Images? Layout? You may want to reword your question too - I didn't understand the second half. – RichH Nov 08 '09 at 20:07
  • 1
    This link can help you: http://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text. And it's `its` contents ;-) – RedGlyph Nov 08 '09 at 20:13
  • You might find this thread useful. – jkndrkn Nov 08 '09 at 20:04

2 Answers2

5

For Perl, check out these modules:

Ether
  • 53,118
  • 13
  • 86
  • 159
1

Parsing PDF and making something useful out of it is hard as the format is focused on keeping the layout so text can be stored in a way that each letter is positioned individually, depending on the font the text might also be stored as graphic.

libraries to read PDFs I know include the Zend Framework which has a PDF component which includes a PDF parser which can be used from PHP and gives more or less usaable results and the commercial PDFlib which offers quite usable results and offers binding to different languages.

johannes
  • 15,807
  • 3
  • 44
  • 57