Find text position in PDF file

Question

I have a PDF file and I am trying to find a specific text in the PDF and highlight it using Python. I found pypdf, which can highlight part of a PDF when we give the coordinates of the wanted highlight position in the file.

I am trying to find a tool which can give me the position of a given text in the PDF.

Have you tried searching for Python libraries which are able to parse PDF files? — Ciaran Gallagher, Nov 26 '17 at 14:51
hopefully this helps: https://stackoverflow.com/questions/8971243/free-tool-for-watching-coordinates-in-pdf — Adarsh, Nov 26 '17 at 15:12
Also searching for this functionality with no luck so far (would like to have it work via command line)... — Photon, Jun 11 '18 at 19:11

Cilantro Ditrek · Accepted Answer · 2023-07-11T20:04:23.083

37

PyMuPDF can find text by coordinates. You can use this in conjunction with the PyPDF2 highlighting method to accomplish what you're describing. Or you can just use PyMuPDF to highlight the text.

Here is sample code for finding text and highlighting with PyMuPDF:

import fitz

### READ IN PDF
doc = fitz.open("input.pdf")

for page in doc:
    ### SEARCH
    text = "Sample text"
    text_instances = page.search_for(text)

    ### HIGHLIGHT
    for inst in text_instances:
        highlight = page.add_highlight_annot(inst)
        highlight.update()


### OUTPUT
doc.save("output.pdf", garbage=4, deflate=True, clean=True)

edited Jul 11 '23 at 20:04

answered Oct 24 '18 at 20:39

Cilantro Ditrek

1,047
1
14
26

On windows i can not install fitz. – keramat Jul 06 '19 at 07:17
2

@keramat there is issue with 32/64 bit version. you need to install lower version of PyMuPDF. pip install PyMuPDF==1.16.7 worked, but the default one and latest didn't Please look here for more information ; https://github.com/pymupdf/PyMuPDF/issues/414 – Sangit Gurung Feb 14 '20 at 00:54
Thanks for this @cilantro Ditrek, Could you please let me know if there is a way to draw a red line box around that text instead of just highlighting – SMR Feb 04 '21 at 12:42
@SMP check out this answer: https://stackoverflow.com/a/60559033/1301888 and there's more info here: https://pymupdf.readthedocs.io/en/latest/annot.html#Annot.set_colors – Cilantro Ditrek Feb 04 '21 at 15:14
3

The above code needs a little help. Add `highlight.update()` right after `highlight = ...` Also, if the pdf document has more than one page, then wrap the `### SEARCH` and `### HIGHLIGHT` sections in a `for page in doc:` loop and get rid of `page = doc[0]`. – user1045680 May 01 '21 at 14:28
1

In Ubuntu 18.04 ''Bionic" it works with `pip3 install PyMuPDF==1.16` ; even though I have installed libmupdf-dev version 1.12 – am70 Jun 06 '21 at 10:06

score 5 · Answer 2 · answered Feb 22 '22 at 03:24

With the new version of PyMuPDF, some methods got depreciated. Here is the sample code as per the recent version. Secondly, I've also added a comment for each highlight which facilities the user to transverse.

pdfIn = fitz.open("page-4.pdf")

for page in pdfIn:
    print(page)
    texts = ["SEPA", "voorstelnummer"]
    text_instances = [page.search_for(text) for text in texts] 
    
    # coordinates of each word found in PDF-page
    print(text_instances)  

    # iterate through each instance for highlighting
    for inst in text_instances:
        annot = page.add_highlight_annot(inst)
        # annot = page.add_rect_annot(inst)
        
        ## Adding comment to the highlighted text
        info = annot.info
        info["title"] = "word_diffs"
        info["content"] = "diffs"
        annot.set_info(info)
        annot.update()


# Saving the PDF Output
pdfIn.save("page-4_output.pdf")

Yiping · Answer 3 · 2021-05-21T03:58:12.867

If you are on Windows and have Acrobat Pro (not reader), you can try the old Component Object Model with Python or VBA.

import win32com, winerror, os
from win32com.client.dynamic import ERRORS_BAD_CONTEXT
ERRORS_BAD_CONTEXT.append(winerror.E_NOTIMPL)
win32com.client.gencache.EnsureModule('{E64169B3-3592-47d2-816E-602C5C13F328}', 0, 1, 1)
avDoc = win32com.client.DispatchEx('AcroExch.AVDoc')
avDoc.Open(src, src)

avDoc.BringToFront()
pdDoc = avDoc.GetPDDoc()
jsoObject = pdDoc.GetJSObject()

for pageNo in range(1):
    pdfPage = pdDoc.AcquirePage(pageNo)
    pageHL = win32com.client.DispatchEx('AcroExch.HiliteList')
    _ = pageHL.Add(0, 9000)
    pageSel = pdfPage.CreatePageHilite(pageHL)

    pdfText = ""
    for wordNo in range(pageSel.GetNumText()):
        word = pageSel.GetText(wordNo)
        pdfText += word

        if keyword in pdfText:
            wordToHl = win32com.client.DispatchEx('AcroExch.HiliteList')
            wordToHl.Add(wordNo, 1)
            wordHl = pdfPage.CreateWordHilite(wordToHl)
            rect = wordHl.GetBoundingRect()
            annot = jsoObject.AddAnnot()
            props = annot.GetProps()
            props.Type = "Square"
            props.Page = pageNo
            props.Hidden = False
            props.Lock = True
            props.Name = word
            props.NoView = False
            props.Opacity = 0.3
            props.ReadOnly = True
            props.Style = "S"
            props.ToggleNoView = False
            props.PopupOpen = False
            popupRect = [rect.Left - 5, rect.Top + 5, rect.Left + 40, rect.Top - 20]
            props.Rect = popupRect
            props.PopupRect = popupRect
            props.StrokeColor = jsoObject.Color.Red
            props.FillColor = jsoObject.Color.Yellow

            annot.SetProps(props)
            print(f'Found {keyword}')

Find text position in PDF file

3 Answers3

Linked

Related