I am using python 3.5 and I want to read the text, line by line from pdf files. Was trying to use pdfminer3k
but not getting proper syntax anywhere.
How to use it correctly?
Asked
Active
Viewed 1.6k times
12

smci
- 32,567
- 20
- 113
- 146

poshita singh
- 131
- 1
- 1
- 9
2 Answers
15
I have corrected Lisa's code. It works now!
fp = open(path, 'rb')
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
laparams.char_margin = 1.0
laparams.word_margin = 1.0
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
extracted_text = ''
for page in doc.get_pages():
interpreter.process_page(page)
layout = device.get_result()
for lt_obj in layout:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
extracted_text += lt_obj.get_text()

Matphy
- 1,086
- 13
- 21
-
Could you add some description about the difference between your code and Lise's? – jhhoff02 Jul 14 '17 at 12:53
-
`extracted_text += string` is changed to `extracted_text += lt_obj.get_text()`. – Matphy Jul 17 '17 at 07:31
-
[This answer](https://stackoverflow.com/a/56125023/1681480) has some corrections to the code above. I also deleted the `doc.set_parser` and `doc.initialize` lines to get it to work. – beroe Aug 19 '19 at 08:31
2
I am using python 3.4 but I guess that it works the same way with python 3.5. Here is what I use:
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
parser = PDFParser(file_content)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize('')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
#I changed the following 2 parameters to get rid of white spaces inside words:
laparams.char_margin = 1.0
laparams.word_margin = 1.0
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
extracted_text = ''
# Process each page contained in the document.
for page in doc.get_pages():
interpreter.process_page(page)
layout = device.get_result()
for lt_obj in layout:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
extracted_text += string
with open('convertedFile.txt',"wb") as txt_file:
txt_file.write(extracted_text.encode("utf-8"))

Lise
- 31
- 3