1

I am trying to extract the content from a PDF in order to create an excel sheet out of it.

What I tried

import pdfquery 
pdf = pdfquery.PDFQuery('C:\\Users\\Santosh\\Downloads\\2017-San-Jamar-
Price-List-US-Z120913E-RevA.pdf')
page = pdf.get_page(3)
page_content = page.extractText()
print (page_content)

It throws the following error:

AttributeError                            Traceback (most recent call last)
<ipython-input-32-d6b615faa422> in <module>() 
      1 page = pdf.get_page(3)
----> 2 page_content = page.extractText()
      3 print (page_content)

AttributeError: 'PDFPage' object has no attribute 'extractText'

Please let me know a possible solution.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Santosh
  • 103
  • 2
  • 4
  • 13

3 Answers3

2

Use PyPDF2 instead of pdfquery

from PyPDF2 import PdfReader

reader = PdfReader('C:\\Users\\Santosh\\Downloads\\2017-San-Jamar-
Price-List-US-Z120913E-RevA.pdf')
page = reader.pages[3]
print(page.extract_text())
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Tejas Mankar
  • 108
  • 1
  • 7
1

I had also face the same issue. This is due to the non updated version of pypdf2 package installed already with other pdf reader dependencies. By reinstalling pypdf2 is resolved my error.

pip uninstall pypdf2
pip install pypdf2

This worked for me

Berlin Benilo
  • 472
  • 1
  • 12
0

I reinstalled PyPDF2 after uninstalling PyPDF and PyPDF, and the issue was resolved.

pip uninstall PyPDF
pip uninstall PyPDF2
pip install PyPDF2