How can I read pdf in python? I know one way of converting it to text, but I want to read the content directly from pdf.
Can anyone explain which module in python is best for pdf extraction
How can I read pdf in python? I know one way of converting it to text, but I want to read the content directly from pdf.
Can anyone explain which module in python is best for pdf extraction
You can USE PyPDF2 package
# install PyPDF2
pip install PyPDF2
Once you have it installed:
# importing all the required modules
import PyPDF2
# creating a pdf reader object
reader = PyPDF2.PdfReader('example.pdf')
# print the number of pages in pdf file
print(len(reader.pages))
# print the text of the first page
print(reader.pages[0].extract_text())
Follow the documentation.
You can use textract module in python
Textract
for install
pip install textract
for read pdf
import textract
text = textract.process('path/to/pdf/file', method='pdfminer')
For detail Textract