1

I am new to python and have been working on a project to make a new pdf with highlighted text. I am using pymupdf to get the text and am storing the text, font size, and the index of the text.

I found a way to highlight the text but it searches and highlights all occurrences of the string (text).

    import fitz
### READ IN PDF    
doc = fitz.open("input.pdf")
page = doc[0]    
### SEARCH    
text = "Sample text"
text_instances = page.searchFor(text)    
### HIGHLIGHT    
for inst in text_instances:
    highlight = page.addHighlightAnnot(inst)     
### OUTPUT    
doc.save("output.pdf", garbage=4, deflate=True, clean=True)

I need a way to highlight any specific line/word (not all) Or maybe how to store the rect coordinates of each line.

One example of the usage would be if there is a heading called Summary and in the text in this heading there are occurances of "summary" I want to highlight only the heading (or the text in paragraph).

yoyo yoyo
  • 21
  • 1
  • 3
  • 1
    If I am able to store the coordinates of the text while I am extracting them then also it solves the problem. As later I can do `page.addHighlightAnnot(coordinates)` and get the highlight. But I don't know how to get these coordinates. – yoyo yoyo Aug 25 '20 at 15:36

1 Answers1

0

You can highlight text using PyPDF2..

In order to find the text's location, check out this.

Revisto
  • 1,211
  • 7
  • 11
  • 2
    Yes, but the thing is that it gives the location of all the occurrences. example: text_instances will have the location of all the "summary" but I want only on the heading. What I want is a way to highlight the text by its index number, not the text itself – yoyo yoyo Aug 25 '20 at 12:59