5

I want to read in a PDF file using PyMuPDF. All I need is plain text (no need to extract info on color, fonts, tables etc.).

I have tried the following

import fitz
from fitz import TextPage
ifile = "C:\\user\\docs\\aPDFfile.pdf"
doc = TextPage(ifile)
>>> TypeError: in method 'new_TextPage', argument 1 of type 'struct fz_rect_s *'

Which doesn't work, so then I tried

doc = fitz.Document(ifile)
t = TextPage.extractText(doc)
>>> AttributeError: 'Document' object has no attribute '_extractText'

which again doesn't work.

Then I found a great blog from one of the authors of PyMuPDF which has detailed code on extracting text in the order it is read from the file. But everytime I run this code with a different PDF I get KeyError: 'lines' (line 81 in the code) or KeyError: "bbox" (line 60 in the code).

I can't post the PDF's here because they are confidential, and I appreciate that would be useful information to have here. But is there any way I can just do the simplest task which PyMuPDF is meant to do: extract plain text from a PDF, un-ordered or otherwise (I don't mind much)?

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
PyRsquared
  • 6,970
  • 11
  • 50
  • 86
  • Do a check that the key you are looking for (e.g., `lines` or `bbox`) is in the dictionary (e.g., a block) before accessing that key. – J. Owens Jun 04 '18 at 15:38

4 Answers4

9

Message from the repo maintainer:

The easiest way to extract plain text but still do at least basic ordering is

blocks = page.get_text("blocks")
blocks.sort(key=lambda block: block[1])  # sort vertically ascending

for b in blocks:
    print(b[4])  # the text part of each block

In newer versions (1.19.x and later), the above is even simpler: Just do text = page.get_text(sort=True). It will return the full page's text as a string and the basic reading order top-left to bottom-right.

Jorj McKie
  • 2,062
  • 1
  • 13
  • 17
8

The process of extracting text following your example using PyMuPDF is:

import fitz

filepath = "C:\\user\\docs\\aPDFfile.pdf"

text = ''
with fitz.open(filepath ) as doc:
    for page in doc:
        text+= page.getText()
print(text)

The blog you followed is great, but a little bit outdated, some of the methods are depreciated.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Vasko
  • 343
  • 1
  • 4
  • 9
-1
import fitz

filepath = "C:\\user\\docs\\aPDFfile.pdf"

text = ''
with fitz.open(filepath ) as doc:
    for page in doc:
        text+= page.get_text()

print(text)
Antoine
  • 1,393
  • 4
  • 20
  • 26
  • Remember that Stack Overflow isn't just intended to solve the immediate problem, but also to help future readers find solutions to similar problems, which requires understanding the underlying code. This is especially important for members of our community who are beginners, and not familiar with the syntax. Given that, **can you [edit] your answer to include an explanation of what you're doing** and why you believe it is the best approach? – Jeremy Caney Dec 12 '21 at 01:03
-1

use small T in gettext():

import fitz

filepath = "C:\\user\\docs\\aPDFfile.pdf"

text = ''
with fitz.open(filepath ) as doc:
    for page in doc:
        text+= page.gettext()
print(text)

it's work for you

Michael S.
  • 3,050
  • 4
  • 19
  • 34