2

I have python3.2 installed (I'm not sure if it matters, but it's a 64-bit version) on a Windows machine. I need to open a bunch of pdf files, find certain numbers from the text and store them. Work I should have to ( and the maximum I'd like to) put in is one day.

  • Can I parse the pdf with plain python without too much of a hassle?
  • Is there a library that would achieve this easily?

If it's too complicated to do it with this python installation I can install a different set, but that requires alot of extra work, so other solutions appreciated.

schme
  • 202
  • 2
  • 15
  • 1
    What do you mean by plain python? No third party modules? You might find the PyPi package [pyPdf](http://pypi.python.org/pypi/pyPdf/) to be a useful resource. –  Jul 11 '12 at 06:50
  • I assumed the 3.0 compatibility meant it wouldn't support 3.2. But if it does, it is pretty much what I'm looking for. – schme Jul 11 '12 at 07:18
  • Never tried this but maybe PDFMiner is what you need: http://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text/1257121#1257121 – Ekaterina Mishina Jul 12 '12 at 09:06
  • PDFMiner was close, and pdf2txt.py was pretty good but still insufficient. The files I have don't appear to be well formed (I don't understand pdf's so don't take this as is), because all the text-extracts are useless for getting the data (it appears out of order or lacking). I have yet to browse through the more advanced features of PDFMiner. – schme Jul 13 '12 at 07:23
  • 1
    @skhme Sometimes the text cannot be extracted from PDF documents because it is not there (think about scanned papers). But you probably know that. Similarly, the font can be included inside the PDF document. This way it is possible to *encipher* the document so that you can see, but you cannot extract. – pepr Jul 13 '12 at 12:57

0 Answers0