How to extract title of each page from the PDF using Python

Asked Jul 13 '22 at 06:13

Active Jul 13 '22 at 07:07

Viewed 469 times

I want to extract the title of each page of PDF, but my pdfs does not have similar or predefine size of title (title size is varying in every page), I tried following code, but its not giving me the expected output, instead its extracting whole text of that page

import PyPDF2
from PyPDF2 import PdfFileReader, PdfFileWriter

filenames = ['Test2.pdf']
# filenames = ['sample-pdf-download-10-mb.pdf', 'sample-pdf-file.pdf', 'sample-pdf-with-images.pdf']
pdf_Writer = PdfFileWriter()

for filename in filenames:
    pdfFileObj = open(filename, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    num_pages = pdfReader.numPages
    count = 0
    text = ""

    while count < num_pages:
        pageObj = pdfReader.getPage(count)
        count += 1
        text += pageObj.extractText()
        print(count, "= ", pageObj.extractText().title())

Also how can I extract highlighted text from PDF?

edited Jul 13 '22 at 07:07

molbdnilo

64,751
3
43
82

asked Jul 13 '22 at 06:13

Prajkta Mangulkar

Highlights are an annotation. So far we don't support extracting text from a specific region only – Martin Thoma Jul 30 '22 at 10:09

How to extract title of each page from the PDF using Python

0 Answers0