49

How can I read pdf in python? I know one way of converting it to text, but I want to read the content directly from pdf.

Can anyone explain which module in python is best for pdf extraction

peterh
  • 11,875
  • 18
  • 85
  • 108
sg1994
  • 557
  • 1
  • 4
  • 6

2 Answers2

73

You can USE PyPDF2 package

# install PyPDF2
pip install PyPDF2

Once you have it installed:

# importing all the required modules
import PyPDF2

# creating a pdf reader object
reader = PyPDF2.PdfReader('example.pdf')

# print the number of pages in pdf file
print(len(reader.pages))

# print the text of the first page
print(reader.pages[0].extract_text())

Follow the documentation.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
shankarj67
  • 979
  • 1
  • 9
  • 10
13

You can use textract module in python

Textract

for install

pip install textract

for read pdf

import textract
text = textract.process('path/to/pdf/file', method='pdfminer')

For detail Textract

Community
  • 1
  • 1
Kallz
  • 3,244
  • 1
  • 20
  • 38