I am using Python 3.4 and need to extract all the text from a PDF and then use it for text processing.
All the answers I have seen suggest options for Python 2.7.
I need something in Python 3.4.
Bonson
I am using Python 3.4 and need to extract all the text from a PDF and then use it for text processing.
All the answers I have seen suggest options for Python 2.7.
I need something in Python 3.4.
Bonson
You need to install the pypdf package to be able to work with PDFs in Python. pypdf can extract text/images. The text is returned as a Python string. To install it, run pip install pypdf
from the command line. This module name is case-sensitive so make sure to type all lowercase.
from pypdf import PdfReader
reader = PdfReader('my_file.pdf')
print(len(reader.pages)) # gives '56'
page = reader.pages[9] #'9' is the page number
page.extract_text()
The last statement returns all the text that is available in page 9 of 'my_file.pdf' document.
pdfminer.six ( https://github.com/pdfminer/pdfminer.six ) has also been recommended elsewhere and is intended to support Python 3. I can't personally vouch for it though, since it failed during installation MacOS. (There's an open issue for that and it seems to be a recent problem, so there might be a quick fix.)
Complementing @Sarah's answer. PDFMiner is a pretty good choice. I have been using it from quite some time, and until now it works pretty good on extracting the text content from a PDF. What I did is to create a function which uses the CLI client from pdfminer, and then it saves the output into a variable (which I can use later on somewhere else). The Python version I am using is 3.6
, and the function works pretty good and does the required job, so maybe this can work for you:
def pdf_to_text(filepath):
print('Getting text content for {}...'.format(filepath))
process = subprocess.Popen(['pdf2txt.py', filepath], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
stdout, stderr = process.communicate()
if process.returncode != 0 or stderr:
raise OSError('Executing the command for {} caused an error:\nCode: {}\nOutput: {}\nError: {}'.format(filepath, process.returncode, stdout, stderr))
return stdout.decode('utf-8')
You will have to import the subprocess module of course: import subprocess
slate3k is good for extracting text. I've tested it with a few PDF files using Python 3.7.3, and it's a lot more accurate than PyPDF2, for instance. It's a fork of slate, which is a wrapper for PDFMiner. Here's the code I am using:
import slate3k as slate
with open('Sample.pdf', 'rb') as f:
doc = slate.PDF(f)
doc
#prints the full document as a list of strings
#each element of the list is a page in the document
doc[0]
#prints the first page of the document
Credit to this comment on GitHub: https://github.com/mstamy2/PyPDF2/issues/437#issuecomment-400491342
import pdfreader
pdfFileObj = open('/tmp/Test-test-test.pdf','rb')
viewer = SimplePDFViewer(pdfFileObject)
viewer.navigate(1)
viewer.render()
viewer.canvas.strings