1

I have a pdf file that I am reading using pymupdf using the below syntax.

import fitz  # this is pymupdf

with fitz.open('file.pdf') as doc:

    text = ""
    for page in doc:
        text += page.getText()

Is there a way to ignore the header and footer while reading it?

I tried converting pdf to docx as it is easier to remove headers, but the pdf file I am working on is getting reformatted when I convert it to docx.

Is there any way pymupdf does this during the read?

Jeff Schaller
  • 2,352
  • 5
  • 23
  • 38

1 Answers1

1

The documentation has a page dedicated to this problem.

  1. Define rectangle that omits the header
  2. Use page.get_textbox(rect) method.

Source: https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/textbox-extraction#2-pageget_textboxrect

The generic solution that works for most pdf libraries is to

  1. check for the size of the header/footer section in your pdf files
  2. loop for each text in the document and check it's vertical position
dzejms
  • 11
  • 1
  • 5