I have thousands of PDF files like this one.
I'm trying to use PyPDF2 to convert them to plain text (code is below). But PyPDF2 apparently only "sees" the watermarks, not the content itself. What could I do here?
import os
import PyPDF2
path_to_pdfs = '/path/to/pdf/files/'
for filename in os.listdir(path_to_pdfs):
if '.pdf' in filename.lower():
with open(path_to_pdfs + filename, mode = 'rb') as f:
txt = ''
pdf_reader = PyPDF2.PdfFileReader(f)
num_pages = pdf_reader.numPages
for page in range(num_pages):
page_obj = pdf_reader.getPage(page)
page_text = page_obj.extractText()
txt = txt + '\n' + page_text
print(txt)
I'm using Python 3.5.1 and PyPDF2 1.26.0 on macOS 10.13.14.