0

I'm trying to extract the text from the pdf url. If I download the PDF I can easily extract the text with the function slate. However, when trying to import the pdf with io and extract the text, the output returned is just nothing. The code in attached below.

import requests, PyPDF2, io
from io import BytesIO

url = 'https://www.poderjudicial.es/search/contenidos.action?action=accessToPDF&publicinterface=true&tab=AN&reference=e3ca421447bc6b71&encode=true&optimize=20210216&databasematch=AN'

response = requests.get(url)
f = io.BytesIO(response.content)

with f as data:
    read_pdf = PyPDF2.PdfFileReader(data)
    page = read_pdf.getPage(1)
    print(page.extractText())

I have tried a bunch of other functions but is not working. Am I doing something wrong?

martineau
  • 119,623
  • 25
  • 170
  • 301
Ana
  • 1
  • The first ten bytes of `response.content` are `b'%PDF-1.4\n%'`, so this seems a valid PDF file in the program. Did you try printing attribute `read_pdf.numPages`? – VirtualScooter Feb 27 '21 at 20:01

1 Answers1

0

It gives me the blank output as well. I am not sure why. But have you tried using pdfminer3 . It gives me the proper output as text. The following code gives me the proper output for the file.

import requests
from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
import io

resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)

url = 'https://www.poderjudicial.es/search/contenidos.action?action=accessToPDF&publicinterface=true&tab=AN&reference=e3ca421447bc6b71&encode=true&optimize=20210216&databasematch=AN'

response = requests.get(url)
f = io.BytesIO(response.content)

with f as fh:

    for page in PDFPage.get_pages(fh,
                                  caching=True,
                                  check_extractable=True):
        page_interpreter.process_page(page)

    text = fake_file_handle.getvalue()

# close open handles
converter.close()
fake_file_handle.close()

print(text)

Check out this post as well How to use PDFminer.six with python 3?.