Highlight text content in pdf files using python and save a screenshot

Question

I have a list of pdf files and I need to highlight specific text on each page of these files and save a snapshot for each of the text instances.

So far I am able to highlight the text and save the entire page of a pdf file as a snapshot. But, I want to find the position of highlighted text and take a zoomed in the snapshot which will be more detailed compared to the full page snapshot.

I'm pretty sure there must be a solution to this problem. I am new to Python and hence I am not able to find it. I would be really grateful if someone can help me out with this.

I have tried using PyPDF2, Pymupdf libraries but I couldn't figure out the solution. I also tried highlighting by providing coordinates which works but couldn't find a way to get these coordinates as output.

[![Sample snapshot from the code[![\]\[1\]][1]][1]][1]

#import PyPDF2
import os
import fitz
from wand.image import Image
import csv
#import re
#from pdf2image import convert_from_path

check = r'C:\Users\Pradyumna.M\Desktop\Pradyumna\Automation\Intel Bytes\Create Source Docs\Sample Check 8 Apr 2019'

dir1 = check + '\\Source Docs\\'
dir2 = check + '\\Output\\'

dir = [dir1, dir2]

for x in dir:
    try:
        os.mkdir(x)
    except FileExistsError:
        print("Directory ", x, " already exists")

### READ PDF FILE
with open('upload1.csv', newline='') as myfile:
    reader = csv.reader(myfile)
    for row in reader:
        rowarray = '; '.join(row)
        src = rowarray.split("; ")
        file = check + '\\' + src[4] + '.pdf'
        print(file)
        #pdfFileObj = open(file,'rb')
        #pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        #print("Total number of pages: " + str(pdfReader.numPages))
        doc = fitz.open(file)
        print(src[5])
        for i in range(int(src[5])-1, int(src[5])):
            i = int(i)
            page = doc[i]
            print("Processing page: " + str(i))
            text = src[3]
            #SEARCH TEXT
            print("Searching: " + text)
            text_instances = page.searchFor(text)
            for inst in text_instances:
                highlight = page.addHighlightAnnot(inst)
                file1 = check + '\\Output\\' + src[4] + '_output.pdf'
                print(file1)
                doc.save(file1, garbage=4, deflate=True, clean=True)
                ### Screenshot
                with(Image(filename=file1, resolution=150)) as source:
                    images = source.sequence
                    newfilename = check + "\\Source Docs\\" + src[0] + '.jpeg'
                    Image(images[i]).save(filename=newfilename)
                    print("Screenshot of " + src[0] + " saved")

Hello, what have you tried ? Have you reached a particular problem ? — , Apr 16 '19 at 07:35
@reportgunner I have tried the above libraries. My problem is that i am unable to extract coordinates of the highlighted text from pdf files. — Godfrey, Apr 16 '19 at 09:31
have a look at [this](https://stackoverflow.com/questions/22898145/how-to-extract-text-and-text-coordinates-from-a-pdf-file) and [this](https://stackoverflow.com/questions/25248140/how-does-one-obtain-the-location-of-text-in-a-pdf-with-pdfminer) — , Apr 16 '19 at 10:10

score 10 · Accepted Answer · answered Apr 16 '19 at 12:22

"couldn't find a way to get these coordinates as output" - you can get the coordinates out by doing this:

for inst in text_instances:
    print(inst)

inst are fitz.Rect objects which contain the top left and bottom right coordinates of the piece of text that was found. All the information is available in the docs.

I managed to highlight points and also save a cropped region using the following snippet of code. I am using python 3.7.1 and my output for fitz.version is ('1.14.13', '1.14.0', '20190407064320').

import fitz

doc = fitz.open("foo.pdf")
inst_counter = 0
for pi in range(doc.pageCount):
    page = doc[pi]

    text = "hello"
    text_instances = page.searchFor(text)

    five_percent_height = (page.rect.br.y - page.rect.tl.y)*0.05

    for inst in text_instances:
        inst_counter += 1
        highlight = page.addHighlightAnnot(inst)

        # define a suitable cropping box which spans the whole page 
        # and adds padding around the highlighted text
        tl_pt = fitz.Point(page.rect.tl.x, max(page.rect.tl.y, inst.tl.y - five_percent_height))
        br_pt = fitz.Point(page.rect.br.x, min(page.rect.br.y, inst.br.y + five_percent_height))
        hl_clip = fitz.Rect(tl_pt, br_pt)

        zoom_mat = fitz.Matrix(2, 2)
        pix = page.getPixmap(matrix=zoom_mat, clip = hl_clip)
        pix.writePNG(f"pg{pi}-hl{inst_counter}.png")

doc.close()

I tested this on a sample pdf that i peppered with "hello":

Some of the outputs from the script:

I composed the solution out of the following pages of the documentation:

Tutorial page to get introduced into the library
page.searchFor to figure out the return type of the searchFor method
fitz.Rect to understand what the returned objects from page.searchFor are
Collection of Recipes page (called faq in the URL) to figure out how to crop and save part of a pdf page

this solution is perfect. Also, thanks for sharing all these links. they will be very useful. — Godfrey, Apr 18 '19 at 04:13
can we use the same to extract the co-ordinates of pre-highlighted content in the pdf files? — Godfrey, May 08 '19 at 15:13
I don't know. The only time I used the library is to answer your question. Check out the docs; there maybe something to extract highlights in a page. — SpaceMonkey55, May 08 '19 at 15:22
This is one of the best Answers on Stackoverflow on PDF Highlighting using Python, thanks a ton @SpaceMonkey55 :-) — Vetrivel PS, Jun 06 '22 at 12:44
Some methods are being written with not camel case with shish kabab: `pix._writeIMG(f"pg{pi}-hl{inst_counter}.png", format=1)` — Suat Atan PhD, Jan 23 '23 at 13:55

Highlight text content in pdf files using python and save a screenshot

1 Answers1