python - read pdf ignoring header and footer

Question

I have a pdf file that I am reading using pymupdf using the below syntax.

import fitz  # this is pymupdf

with fitz.open('file.pdf') as doc:

    text = ""
    for page in doc:
        text += page.getText()

Is there a way to ignore the header and footer while reading it?

I tried converting pdf to docx as it is easier to remove headers, but the pdf file I am working on is getting reformatted when I convert it to docx.

Is there any way pymupdf does this during the read?

dzejms · Answer 1 · 2021-12-02T18:02:47.087

The documentation has a page dedicated to this problem.

The generic solution that works for most pdf libraries is to

1 Answers1