26

I have a PDF file and I am trying to find a specific text in the PDF and highlight it using Python. I found pypdf, which can highlight part of a PDF when we give the coordinates of the wanted highlight position in the file.

I am trying to find a tool which can give me the position of a given text in the PDF.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Simdan
  • 391
  • 1
  • 4
  • 11

3 Answers3

37

PyMuPDF can find text by coordinates. You can use this in conjunction with the PyPDF2 highlighting method to accomplish what you're describing. Or you can just use PyMuPDF to highlight the text.

Here is sample code for finding text and highlighting with PyMuPDF:

import fitz

### READ IN PDF
doc = fitz.open("input.pdf")

for page in doc:
    ### SEARCH
    text = "Sample text"
    text_instances = page.search_for(text)

    ### HIGHLIGHT
    for inst in text_instances:
        highlight = page.add_highlight_annot(inst)
        highlight.update()


### OUTPUT
doc.save("output.pdf", garbage=4, deflate=True, clean=True)
Cilantro Ditrek
  • 1,047
  • 1
  • 14
  • 26
  • On windows i can not install fitz. – keramat Jul 06 '19 at 07:17
  • 2
    @keramat there is issue with 32/64 bit version. you need to install lower version of PyMuPDF. pip install PyMuPDF==1.16.7 worked, but the default one and latest didn't Please look here for more information ; https://github.com/pymupdf/PyMuPDF/issues/414 – Sangit Gurung Feb 14 '20 at 00:54
  • Thanks for this @cilantro Ditrek, Could you please let me know if there is a way to draw a red line box around that text instead of just highlighting – SMR Feb 04 '21 at 12:42
  • @SMP check out this answer: https://stackoverflow.com/a/60559033/1301888 and there's more info here: https://pymupdf.readthedocs.io/en/latest/annot.html#Annot.set_colors – Cilantro Ditrek Feb 04 '21 at 15:14
  • 3
    The above code needs a little help. Add `highlight.update()` right after `highlight = ...` Also, if the pdf document has more than one page, then wrap the `### SEARCH` and `### HIGHLIGHT` sections in a `for page in doc:` loop and get rid of `page = doc[0]`. – user1045680 May 01 '21 at 14:28
  • 1
    In Ubuntu 18.04 ''Bionic" it works with `pip3 install PyMuPDF==1.16` ; even though I have installed libmupdf-dev version 1.12 – am70 Jun 06 '21 at 10:06
5

With the new version of PyMuPDF, some methods got depreciated. Here is the sample code as per the recent version. Secondly, I've also added a comment for each highlight which facilities the user to transverse.

pdfIn = fitz.open("page-4.pdf")

for page in pdfIn:
    print(page)
    texts = ["SEPA", "voorstelnummer"]
    text_instances = [page.search_for(text) for text in texts] 
    
    # coordinates of each word found in PDF-page
    print(text_instances)  

    # iterate through each instance for highlighting
    for inst in text_instances:
        annot = page.add_highlight_annot(inst)
        # annot = page.add_rect_annot(inst)
        
        ## Adding comment to the highlighted text
        info = annot.info
        info["title"] = "word_diffs"
        info["content"] = "diffs"
        annot.set_info(info)
        annot.update()


# Saving the PDF Output
pdfIn.save("page-4_output.pdf")

RevolverRakk
  • 309
  • 4
  • 10
0

If you are on Windows and have Acrobat Pro (not reader), you can try the old Component Object Model with Python or VBA.

enter image description here

import win32com, winerror, os
from win32com.client.dynamic import ERRORS_BAD_CONTEXT
ERRORS_BAD_CONTEXT.append(winerror.E_NOTIMPL)
win32com.client.gencache.EnsureModule('{E64169B3-3592-47d2-816E-602C5C13F328}', 0, 1, 1)
avDoc = win32com.client.DispatchEx('AcroExch.AVDoc')
avDoc.Open(src, src)

avDoc.BringToFront()
pdDoc = avDoc.GetPDDoc()
jsoObject = pdDoc.GetJSObject()

for pageNo in range(1):
    pdfPage = pdDoc.AcquirePage(pageNo)
    pageHL = win32com.client.DispatchEx('AcroExch.HiliteList')
    _ = pageHL.Add(0, 9000)
    pageSel = pdfPage.CreatePageHilite(pageHL)

    pdfText = ""
    for wordNo in range(pageSel.GetNumText()):
        word = pageSel.GetText(wordNo)
        pdfText += word

        if keyword in pdfText:
            wordToHl = win32com.client.DispatchEx('AcroExch.HiliteList')
            wordToHl.Add(wordNo, 1)
            wordHl = pdfPage.CreateWordHilite(wordToHl)
            rect = wordHl.GetBoundingRect()
            annot = jsoObject.AddAnnot()
            props = annot.GetProps()
            props.Type = "Square"
            props.Page = pageNo
            props.Hidden = False
            props.Lock = True
            props.Name = word
            props.NoView = False
            props.Opacity = 0.3
            props.ReadOnly = True
            props.Style = "S"
            props.ToggleNoView = False
            props.PopupOpen = False
            popupRect = [rect.Left - 5, rect.Top + 5, rect.Left + 40, rect.Top - 20]
            props.Rect = popupRect
            props.PopupRect = popupRect
            props.StrokeColor = jsoObject.Color.Red
            props.FillColor = jsoObject.Color.Yellow

            annot.SetProps(props)
            print(f'Found {keyword}')
Yiping
  • 971
  • 10
  • 31