How to read pdf file using pdfminer3k?

Question

I am using python 3.5 and I want to read the text, line by line from pdf files. Was trying to use pdfminer3k but not getting proper syntax anywhere. How to use it correctly?

score 15 · Answer 1 · answered Jul 14 '17 at 12:35

I have corrected Lisa's code. It works now!

    fp = open(path, 'rb')
    from pdfminer.pdfparser import PDFParser, PDFDocument
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.converter import PDFPageAggregator
    from pdfminer.layout import LAParams, LTTextBox, LTTextLine

    parser = PDFParser(fp)
    doc = PDFDocument()
    parser.set_document(doc)
    doc.set_parser(parser)
    doc.initialize('')
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()
    laparams.char_margin = 1.0
    laparams.word_margin = 1.0
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    extracted_text = ''

    for page in doc.get_pages():
        interpreter.process_page(page)
        layout = device.get_result()
        for lt_obj in layout:
            if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
                extracted_text += lt_obj.get_text()

Could you add some description about the difference between your code and Lise's? — jhhoff02, Jul 14 '17 at 12:53
`extracted_text += string` is changed to `extracted_text += lt_obj.get_text()`. — Matphy, Jul 17 '17 at 07:31
[This answer](https://stackoverflow.com/a/56125023/1681480) has some corrections to the code above. I also deleted the `doc.set_parser` and `doc.initialize` lines to get it to work. — beroe, Aug 19 '19 at 08:31

score 2 · Answer 2 · answered May 18 '17 at 10:20

I am using python 3.4 but I guess that it works the same way with python 3.5. Here is what I use:

from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine

parser = PDFParser(file_content)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
#I changed the following 2 parameters to get rid of white spaces inside words:
laparams.char_margin = 1.0
laparams.word_margin = 1.0
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
extracted_text = ''

# Process each page contained in the document.
for page in doc.get_pages():
    interpreter.process_page(page)
    layout = device.get_result()
    for lt_obj in layout:
        if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
            extracted_text += string

with open('convertedFile.txt',"wb") as txt_file:
    txt_file.write(extracted_text.encode("utf-8"))

replace "string" to "lt_obj.get_text()" – tulsluper Mar 15 '18 at 08:07 — tulsluper, Mar 15 '18 at 08:07

How to read pdf file using pdfminer3k?

2 Answers2

Linked