0

I'm trying to scrape the text from the pdf file on https://www.blackhawk.edu/Portals/0/Public%20PDFs/2019-20/Blackhawk-Staged-Reopening-Plan-2.pdf?ver=2020-07-09-171645-080 I tried the following code, but it failed.

import PyPDF2

url="https://www.blackhawk.edu/Portals/0/Public%20PDFs/2019-20/Blackhawk-Staged-Reopening-Plan-2.pdf?ver=2020-07-09-171645-080"
pdf=requests.get(url).content

with open("my_pdf.pdf", 'wb') as my_data:
my_data.write(pdf)

open_pdf_file = open("my_pdf.pdf", 'rb')
read_pdf = PyPDF2.PdfFileReader(open_pdf_file)

n=read_pdf.getNumPages()

temp=read_pdf.flattenedPages #make a list

temp2=[d.extractText() for d in temp]
temp2="".join(temp2)
      temp2=ext_context(temp2,type="pdf")

print(temp2)

Only some empty circles were scraped but not the text I need. I am new to Python. Any help is appreciated. Thank you for your time in advance.

Yen
  • 1
  • 1
    Does this answer your question? [How to extract text from a PDF file?](https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file) – PApostol Jan 06 '21 at 20:13
  • 1
    Looks like you have a non-searchable PDF, i.e., the text is not stored as text but as an image. See https://www.quora.com/What-is-the-best-way-to-make-a-searchable-PDF-out-of-a-non-searchable-PDF-or-picture-file –  Jan 06 '21 at 20:17
  • Indeed, the "text" in this PDF consists of bitmap images, only the list item circles are real text. To scrape the text nonetheless you need to apply OCR. – mkl Jan 07 '21 at 09:25
  • Thanks to Justin and mkl for the insights. That makes perfect sense. I would think about how to deal with the problem. Any thought or suggestion is appreciated. Thank you again for your help. – Yen Jan 07 '21 at 14:57

0 Answers0