I want to extract the title of each page of PDF, but my pdfs does not have similar or predefine size of title (title size is varying in every page), I tried following code, but its not giving me the expected output, instead its extracting whole text of that page
import PyPDF2
from PyPDF2 import PdfFileReader, PdfFileWriter
filenames = ['Test2.pdf']
# filenames = ['sample-pdf-download-10-mb.pdf', 'sample-pdf-file.pdf', 'sample-pdf-with-images.pdf']
pdf_Writer = PdfFileWriter()
for filename in filenames:
pdfFileObj = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count += 1
text += pageObj.extractText()
print(count, "= ", pageObj.extractText().title())
Also how can I extract highlighted text from PDF?