Extract text from PDF url with io and PyPDF2 gives no output

Question

I'm trying to extract the text from the pdf url. If I download the PDF I can easily extract the text with the function slate. However, when trying to import the pdf with io and extract the text, the output returned is just nothing. The code in attached below.

import requests, PyPDF2, io
from io import BytesIO

url = 'https://www.poderjudicial.es/search/contenidos.action?action=accessToPDF&publicinterface=true&tab=AN&reference=e3ca421447bc6b71&encode=true&optimize=20210216&databasematch=AN'

response = requests.get(url)
f = io.BytesIO(response.content)

with f as data:
    read_pdf = PyPDF2.PdfFileReader(data)
    page = read_pdf.getPage(1)
    print(page.extractText())

I have tried a bunch of other functions but is not working. Am I doing something wrong?

The first ten bytes of `response.content` are `b'%PDF-1.4\n%'`, so this seems a valid PDF file in the program. Did you try printing attribute `read_pdf.numPages`? — VirtualScooter, Feb 27 '21 at 20:01

score 0 · Answer 1 · answered Feb 27 '21 at 21:22

It gives me the blank output as well. I am not sure why. But have you tried using pdfminer3 . It gives me the proper output as text. The following code gives me the proper output for the file.

import requests
from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
import io

resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)

url = 'https://www.poderjudicial.es/search/contenidos.action?action=accessToPDF&publicinterface=true&tab=AN&reference=e3ca421447bc6b71&encode=true&optimize=20210216&databasematch=AN'

response = requests.get(url)
f = io.BytesIO(response.content)

with f as fh:

    for page in PDFPage.get_pages(fh,
                                  caching=True,
                                  check_extractable=True):
        page_interpreter.process_page(page)

    text = fake_file_handle.getvalue()

# close open handles
converter.close()
fake_file_handle.close()

print(text)

Check out this post as well How to use PDFminer.six with python 3?.

Extract text from PDF url with io and PyPDF2 gives no output

1 Answers1