Issues with PyMuPDF extracting plain text

Question

I want to read in a PDF file using PyMuPDF. All I need is plain text (no need to extract info on color, fonts, tables etc.).

I have tried the following

import fitz
from fitz import TextPage
ifile = "C:\\user\\docs\\aPDFfile.pdf"
doc = TextPage(ifile)
>>> TypeError: in method 'new_TextPage', argument 1 of type 'struct fz_rect_s *'

Which doesn't work, so then I tried

doc = fitz.Document(ifile)
t = TextPage.extractText(doc)
>>> AttributeError: 'Document' object has no attribute '_extractText'

which again doesn't work.

Then I found a great blog from one of the authors of PyMuPDF which has detailed code on extracting text in the order it is read from the file. But everytime I run this code with a different PDF I get KeyError: 'lines' (line 81 in the code) or KeyError: "bbox" (line 60 in the code).

I can't post the PDF's here because they are confidential, and I appreciate that would be useful information to have here. But is there any way I can just do the simplest task which PyMuPDF is meant to do: extract plain text from a PDF, un-ordered or otherwise (I don't mind much)?

Do a check that the key you are looking for (e.g., `lines` or `bbox`) is in the dictionary (e.g., a block) before accessing that key. — J. Owens, Jun 04 '18 at 15:38

Jorj McKie · Answer 1 · 2022-07-14T12:35:04.640

Message from the repo maintainer:

The easiest way to extract plain text but still do at least basic ordering is

blocks = page.get_text("blocks")
blocks.sort(key=lambda block: block[1])  # sort vertically ascending

for b in blocks:
    print(b[4])  # the text part of each block

In newer versions (1.19.x and later), the above is even simpler: Just do text = page.get_text(sort=True). It will return the full page's text as a string and the basic reading order top-left to bottom-right.

score 8 · Answer 2 · edited Aug 19 '20 at 12:32

8

The process of extracting text following your example using PyMuPDF is:

import fitz

filepath = "C:\\user\\docs\\aPDFfile.pdf"

text = ''
with fitz.open(filepath ) as doc:
    for page in doc:
        text+= page.getText()
print(text)

The blog you followed is great, but a little bit outdated, some of the methods are depreciated.

edited Aug 19 '20 at 12:32

Martin Thoma

124,992
159
614
958

answered Jan 14 '19 at 10:17

Vasko

343
1
4
9

score -1 · Answer 3 · edited Dec 11 '21 at 09:04

-1

import fitz

filepath = "C:\\user\\docs\\aPDFfile.pdf"

text = ''
with fitz.open(filepath ) as doc:
    for page in doc:
        text+= page.get_text()

print(text)

edited Dec 11 '21 at 09:04

Antoine

1,393
4
20
26

answered Dec 11 '21 at 03:13

thunderhit

11

Remember that Stack Overflow isn't just intended to solve the immediate problem, but also to help future readers find solutions to similar problems, which requires understanding the underlying code. This is especially important for members of our community who are beginners, and not familiar with the syntax. Given that, **can you [edit] your answer to include an explanation of what you're doing** and why you believe it is the best approach? – Jeremy Caney Dec 12 '21 at 01:03

score -1 · Answer 4 · edited Oct 14 '22 at 19:19

-1

use small T in gettext():

import fitz

filepath = "C:\\user\\docs\\aPDFfile.pdf"

text = ''
with fitz.open(filepath ) as doc:
    for page in doc:
        text+= page.gettext()
print(text)

it's work for you

edited Oct 14 '22 at 19:19

Michael S.

3,050
4
19
34

answered Oct 11 '22 at 09:11

krrish rajpurohit

1
1

Issues with PyMuPDF extracting plain text

4 Answers4