How extract text geometry using PyPDF2?

Question

I have pdf documents. And it's clear to me how to extract text from it.

I need to extract not only text but also coordinates associated with this text.

It's my code:

from PyPDF2 import PdfReader
pdf_path = 'docs/doc_3.pdf'
pdf = PdfReader(pdf_path)
page_1_object = pdf.getPage(1)
page_1_object.extractText().split("\n")

The result is:

['Creating value for all stakeholders',
 'Anglo\xa0American is re-imagining mining to improve people’s lives.']

I need geometries associated with extracted paragraphs. Might be something like this for example:

['Creating value for all stakeholders', [1,2,3,4,]]
 'Anglo\xa0American is re-imagining mining to improve people’s lives.', [7,8,9,10]]

How I can accomplish it?

Thanks,

score 0 · Accepted Answer · answered Aug 28 '22 at 00:43

Currently that ability is not a PyPDF2 feature, it has the ability for parsing the content as you show extractText() but does not hold the separate glyph xy positions nor output the lines coordinates.

There are other means in python to extract a single or multiple groups of letters that form words.

Using shell commands such as poppler from / in conjunction with a text "word" from PyPDF2 is possible, however the norm would be to run with another Py PDF Lib such as PyMuPDF and here is such an article, https://pyquestions.com/find-text-position-in-pdf-file for highlighting with PyMuPDF input.

The most common means to your goal is probably as described here How to extract text and text coordinates from a PDF file?

How extract text geometry using PyPDF2?

1 Answers1