Extract text from pdf file genrated by chrome's print option using pypdf2

Question

Trying to extract text from pdf file/s using python(v 3.8.2) module pypdf2(v 1.26.0). All good except with particular pdf file/s(generated from chrome print option.)

I have these files over the period that I have generated/downloaded using chrome's print option, where there is an option to save page/document as pdf. I am not able to extract text from these pdf files as code only returns ' '(empty), no problem with other pdf files. If you would like to test yourself you can save any web page as pdf using chrome print option and use that pdf to test. Chrome(v 81.0.4044.138)

Found that chrome uses Skia to save pages as pdf but didn't help to solve the problem. (PDF Producer: Skia/PDF m80)

Found following similar question on Stack Overflow but no body has answered yet and as I am new user I can't comment or add anything hence this new question.

Extract text from pdf converted from webpage using Pypdf2

Following is the code

import PyPDF2
pdfFileObj = open('example.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
pdfFileObj.close()

I am a new user and this is my first time posting question please correct me if I have done anything incorrect(not sure if I have). I assure you I have done my search on google found no solution or lacking knowledge to understand problem/solution. Thank you

are you printing web pages? if so you can scrape them with something like requests+beautifulsoup4 or selenium — bherbruck, May 13 '20 at 19:49
pdf i have are generated over time(years). These web page is a result of filling form and then the resulted page i have saved as pdf. It would require me to fill thousand of form with each different data and then extract from web page — Vishal, May 14 '20 at 06:53

score 1 · Accepted Answer · answered May 13 '20 at 20:25

PyPDF2 is highly unreliable for extracting text from pdf . as pointed out here too. which says:

While PyPDF2 has .extractText(), which can be used on its page objects (not shown in this example), it does not work very well. Some PDFs will return text and some will return an empty string. When you want to extract text from a PDF, you should check out the PDFMiner project instead. PDFMiner is much more robust and was specifically designed for extracting text from PDFs.

Look at my answer for similar question here

Thank you. Pdf miner works. If anyone wants a tutorial for pdf miner. I found it following very helpful. https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/ — Vishal, May 14 '20 at 07:27

Extract text from pdf file genrated by chrome's print option using pypdf2

1 Answers1