6

Since I want to move from python 2 to 3, I tried to work with pdfmine.3kr in python 3.4. It seems like they have edited everything. Their change logs do not reflect the changes they have done but I had no success in parsing pdf with pdfminer3k. For example:

They have moved PDFDocument into pdfparser (sorry, if I spell incorrectly). PDFPage used to have create_pages method which is gone now. All I can see inside PDFPage are internal methods. Does anybody has a working example of pdfminer3k? It seems like there is no new documentation to reflect any of the changes.

Jack_of_All_Trades
  • 10,942
  • 18
  • 58
  • 88

2 Answers2

24

If you are interested in reading text from a pdf file the following code works with pdfminer3k using python 3.4.

from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine

fp = open('file.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in doc.get_pages():
    interpreter.process_page(page)
    layout = device.get_result()
    for lt_obj in layout:
        if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
            print(lt_obj.get_text())

fp.close()
user1767754
  • 23,311
  • 18
  • 141
  • 164
CPB
  • 241
  • 3
  • 3
  • I'm assuming this doesn't work with scanned images? as I probably don't have any textboxes or textlines. – Jglstewart Apr 22 '15 at 13:53
  • @Jgltewart for that kind of pdf documents you have to convert each page to image and use an OCR over each image to get text. An example of OCR is Tesseract there is python code for that – Nwawel A Iroume Dec 08 '15 at 09:17
  • I can confirm that this does solve literally ALL your unicode woes. Haha :) – lol Jun 25 '16 at 07:35
  • I compared this answer to the accepted answer in [this SO post](https://stackoverflow.com/questions/26494211/extracting-text-from-a-pdf-file-using-pdfminer-in-python) and this one doesn't extract nearly as much text – Jeremy Oct 09 '17 at 22:16
  • I've wanted to try this package to do [save some time](https://stackoverflow.com/questions/58547310/anyway-to-multithread-pdf-mining), I know i can do `l=[] if 'x' in lt_obj.get_text(): l.append(page)` but how can I save `l` as a pdf? The pdfminer3k doesnt have that create pages mehod – Moo10000 Feb 06 '20 at 19:22
3

Perhaps,you could use pdfminer.six. It's description:

fork of PDFMiner using six for Python 2+3 compatibility

After installing it using pip:

pip install pdfminer.six

The usage of it is just like pdfminer, at least in my code.

Hope this could save your day :)

Lordran
  • 649
  • 8
  • 15