How can I extract text fragments from PDF with their coordinates in Python?

Question

Given a digitally created PDF file, I would like to extract the text with the coordinates. A bounding box would be awesome, but an anchor + font / font size would work as well.

I've created an example PDF document so that it's easy to try things out / share the result.

What I've tried

pdftotext

pdftotext PDF-export-example.pdf -layout

gives this output. It already contains the text, but the coordinates are not in there.

PyPDF2

PyPDF2 is even worse - also neither coordinates, nor font size and in this case not even ASCII art clues how the layout was:

from PyPDF2 import PdfFileReader


def text_extractor(path):
    with open(path, "rb") as f:
        pdf = PdfFileReader(f)
        page = pdf.getPage(0)
        text = page.extractText()
        print(text)


if __name__ == "__main__":
    path = "PDF-export-example.pdf"
    text_extractor(path)

pdfminer.six

Another method to extract text, but without coordinates / font size. Thank you to Duck puncher for this one:

from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage


def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = "utf-8"
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, "rb")
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()

    for page in PDFPage.get_pages(
        fp,
        pagenos,
        maxpages=maxpages,
        password=password,
        caching=caching,
        check_extractable=True,
    ):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text


if __name__ == "__main__":
    print(convert_pdf_to_txt("PDF-export-example.pdf"))

This one goes a bit more in the right direction as it can give the font name and size. But the coordinates are still missing (and the output is a bit verbose as it is character-by-character):

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar

for page_layout in extract_pages("PDF-export-example.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            for text_line in element:
                for character in text_line:
                    if isinstance(character, LTChar):
                        print(character)
                        print(character.fontname)
                        print(character.size)

tabula-py

Here I don't get anything at all:

from tabula import read_pdf

df = read_pdf("PDF-export-example.pdf")
print(df)

Vishal Singh · Accepted Answer · 2021-02-17T15:41:56.663

I've used PyMuPDF to extract page content as a list of single words with bbox information.

import fitz

doc = fitz.open("PDF-export-example.pdf")

for page in doc:
    wlist = page.getTextWords()  # make the word list
    print(wlist)

Output:

[
    (72.0250015258789, 72.119873046875, 114.96617889404297, 106.299560546875, 'Test', 0, 0, 0),
    (120.26901245117188, 72.119873046875, 231.72618103027344, 106.299560546875, 'document', 0, 0, 1),
    (72.0250015258789, 106.21942138671875, 101.52294921875, 120.18524169921875, 'Lorem', 1, 0, 0),
    (103.98699951171875, 106.21942138671875, 132.00445556640625, 120.18524169921875, 'ipsum', 1, 0, 1),
    (134.45799255371094, 106.21942138671875, 159.06637573242188, 120.18524169921875, 'dolor', 1, 0, 2),
    (161.40098571777344, 106.21942138671875, 171.95208740234375, 120.18524169921875, 'sit', 1, 0, 3),
    ...
]

`page.getTextWords()`

method separates a page’s text into “words” using spaces and line breaks as delimiters. Therefore the words in this lists contain no spaces or line breaks.
Return type: list

An item of this list looks like this:

(x0, y0, x1, y1, "word", block_no, line_no, word_no)

Where the first 4 items are the float coordinates of the words’s bbox. The last three integers provide some more information on the word’s whereabouts.

A Note on the Name fitz
The standard Python import statement for PyMuPDF library is import fitz. This has a historical reason:

The original rendering library for MuPDF was called Libart.

After Artifex Software acquired the MuPDF project, the development focus shifted on writing a new modern graphics library called Fitz. Fitz was originally intended as an R&D project to replace the aging Ghostscript graphics library, but has instead become the rendering engine powering MuPDF.

Wow, this is pretty awesome! Do you know if it is possible to get the Font / Font size as well? — Martin Thoma, Jul 31 '20 at 09:45
this might help https://pymupdf.readthedocs.io/en/latest/faq.html#how-to-analyze-font-characteristics — Vishal Singh, Jul 31 '20 at 09:58
Great answer. Can you also tell what exactly are these coordinate values in? I mean are these pixel values, or some other scale? I see them be floating-point numbers, so I don't think that it is pixel value. I want to display it on the PDF as an image. Using OpenCV for that — Rishabh Gupta, Feb 17 '21 at 12:31
@RishabhGupta take a look at this. https://pymupdf.readthedocs.io/en/latest/rect.html#rect — Vishal Singh, Feb 17 '21 at 15:17

Stef · Answer 2 · 2020-07-30T10:37:41.047

2

You can parse the output of poppler's pdftotext with the -bbox option:

import subprocess
from lxml import etree

file = 'PDF-export-example.pdf'
xml = etree.fromstring(subprocess.check_output(['pdftotext', '-bbox', file , '-']))
for pn, page in enumerate(xml.findall('.//{http://www.w3.org/1999/xhtml}page')):
    for word in page.findall('{http://www.w3.org/1999/xhtml}word'):
        print(pn, word.get('xMin'), word.get('yMin'),
            word.get('xMax'), word.get('yMax'), word.text)

Output:

0 72.025000 72.124000 114.977000 105.780000 Test
0 120.269000 72.124000 231.737000 105.780000 document
0 72.025000 106.220500 101.519500 119.755000 Lorem
0 103.987000 106.220500 132.001000 119.755000 ipsum
0 134.458000 106.220500 159.070000 119.755000 dolor
...

edited Jul 30 '20 at 10:37

answered Jul 30 '20 at 10:26

Stef

28,728
2
24
52

please, how can i get width and height of page using your solution "etree" ? – Data Scientist Oct 14 '21 at 15:03
1

@DataScientist `etree` is used to parse the output of `pdftotext`, which is for text extraction and doesn't show the page size. See https://stackoverflow.com/questions/6230752/extracting-page-sizes-from-pdf-in-python/48886525 for how to get the page size. – Stef Oct 14 '21 at 15:25