10

I am using PyPDF2 for extract text from pdf. All examples which I found in the google look like my code:

import PyPDF2

reader = PyPDF2.PdfFileReader("test2.pdf")
page = reader.getPage(0)
text = page.extractText()
print(text.encode("utf-8"))

However, I have empty text in my console:

b''

This code I have tested for different pdf and all pdf's were empty

UPD:

# getDocumentInfo
{'/Producer': 'Skia/PDF m75'}

File pdf

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
nesalexy
  • 848
  • 2
  • 9
  • 30
  • 2
    Does you pdf files contain really text, or maybe it is some kind of scanned image? Could you attach here file sample? – kosist Apr 10 '19 at 08:51
  • 1
    Refer to this answer, it might help https://stackoverflow.com/a/51328080/6234311 – Reda Drissi Apr 10 '19 at 08:54
  • 1
    @kosist yes, pdf contains really text. I have added a pdf file – nesalexy Apr 10 '19 at 08:58
  • 1
    Did you try others libraries? Because even on github, there is written that sometimes this lib does not extract text properly... – kosist Apr 10 '19 at 10:07
  • If it helps anyone debugging, I get the same issue for PDFs with producer "Skia/PDF m86" – timhj Nov 02 '20 at 05:17

1 Answers1

12

It looks like some font/text combos make the text unreadable by PyPDF2, PyPDF3 or PyPDF4.

To extract the text from these PDFs, you can use the dedicated PDF text extraction package pdfminer.six.

from pdfminer import high_level

local_pdf_filename = "/path/to/pdf/you_want_to_extract_text_from.pdf"
pages = [0] # just the first page

extracted_text = high_level.extract_text(local_pdf_filename, "", pages)
print(extracted_text)

It works on all the pdfs that were failing for me and is super quick to implement as a fallback. Full docs for the extract_text function are here.

timhj
  • 497
  • 4
  • 14