How can I read pdf in python?

Question

How can I read pdf in python? I know one way of converting it to text, but I want to read the content directly from pdf.

Can anyone explain which module in python is best for pdf extraction

score 73 · Answer 1 · edited Dec 20 '22 at 18:04

73

You can USE PyPDF2 package

# install PyPDF2
pip install PyPDF2

Once you have it installed:

# importing all the required modules
import PyPDF2

# creating a pdf reader object
reader = PyPDF2.PdfReader('example.pdf')

# print the number of pages in pdf file
print(len(reader.pages))

# print the text of the first page
print(reader.pages[0].extract_text())

Follow the documentation.

edited Dec 20 '22 at 18:04

Martin Thoma

124,992
159
614
958

answered Aug 21 '17 at 10:56

shankarj67

979
1
9
10

1

Tried using this package with an order form from Amazon. It found 33 pages but extractText() API was empty for all pages – retsigam Jun 15 '21 at 21:35
@Sanket PyPDF2 was improved a lot in 2022. Give it another shot :-) – Martin Thoma Dec 20 '22 at 18:05
@retsigam PyPDF2 was improved a lot in 2022. Please try it again – Martin Thoma Dec 20 '22 at 18:07
2

The library went back to its first name `pypdf` – Zack Walton Feb 17 '23 at 15:13

score 13 · Answer 2 · edited Jun 20 '20 at 09:12

13

You can use textract module in python

Textract

for install

pip install textract

for read pdf

import textract
text = textract.process('path/to/pdf/file', method='pdfminer')

For detail Textract

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 21 '17 at 10:49

Kallz

3,244
1
20
38

14

textract is broken as far as I can tell. – conner.xyz May 14 '18 at 16:58
4

Textract seems to be dead as well: https://github.com/deanmalmgren/textract/issues/350 – Martin Thoma Aug 21 '20 at 07:18
Update 2023: Still not working to read PDF files. – user3503711 Jul 13 '23 at 14:40

How can I read pdf in python?

2 Answers2

Linked