pdfminer3k has no method named create_pages in PDFPage

Question

Since I want to move from python 2 to 3, I tried to work with pdfmine.3kr in python 3.4. It seems like they have edited everything. Their change logs do not reflect the changes they have done but I had no success in parsing pdf with pdfminer3k. For example:

They have moved PDFDocument into pdfparser (sorry, if I spell incorrectly). PDFPage used to have create_pages method which is gone now. All I can see inside PDFPage are internal methods. Does anybody has a working example of pdfminer3k? It seems like there is no new documentation to reflect any of the changes.

what exactly you looking for? how to `create_pages` in pdfminer3k? — avi, Oct 17 '14 at 07:44
I am looking for any examples that allows me to do the same thing with pdfminer with pdfminer3k based on their new api which is not documented anywhere. — Jack_of_All_Trades, Oct 17 '14 at 12:14

score 24 · Answer 1 · edited Jan 17 '18 at 08:15

24

If you are interested in reading text from a pdf file the following code works with pdfminer3k using python 3.4.

from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine

fp = open('file.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in doc.get_pages():
    interpreter.process_page(page)
    layout = device.get_result()
    for lt_obj in layout:
        if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
            print(lt_obj.get_text())

fp.close()

edited Jan 17 '18 at 08:15

user1767754

23,311
18
141
164

answered Jan 02 '15 at 08:29

CPB

241
3
3

I'm assuming this doesn't work with scanned images? as I probably don't have any textboxes or textlines. – Jglstewart Apr 22 '15 at 13:53
@Jgltewart for that kind of pdf documents you have to convert each page to image and use an OCR over each image to get text. An example of OCR is Tesseract there is python code for that – Nwawel A Iroume Dec 08 '15 at 09:17
I can confirm that this does solve literally ALL your unicode woes. Haha :) – lol Jun 25 '16 at 07:35
I compared this answer to the accepted answer in [this SO post](https://stackoverflow.com/questions/26494211/extracting-text-from-a-pdf-file-using-pdfminer-in-python) and this one doesn't extract nearly as much text – Jeremy Oct 09 '17 at 22:16
I've wanted to try this package to do [save some time](https://stackoverflow.com/questions/58547310/anyway-to-multithread-pdf-mining), I know i can do `l=[] if 'x' in lt_obj.get_text(): l.append(page)` but how can I save `l` as a pdf? The pdfminer3k doesnt have that create pages mehod – Moo10000 Feb 06 '20 at 19:22

score 3 · Answer 2 · answered Mar 31 '17 at 01:36

Perhaps,you could use pdfminer.six. It's description:

fork of PDFMiner using six for Python 2+3 compatibility

After installing it using pip：

pip install pdfminer.six

The usage of it is just like pdfminer, at least in my code.

Hope this could save your day :)

pdfminer3k has no method named create_pages in PDFPage

2 Answers2

Linked