4

I have a list of pdf files and I need to highlight specific text on each page of these files and save a snapshot for each of the text instances.

So far I am able to highlight the text and save the entire page of a pdf file as a snapshot. But, I want to find the position of highlighted text and take a zoomed in the snapshot which will be more detailed compared to the full page snapshot.

I'm pretty sure there must be a solution to this problem. I am new to Python and hence I am not able to find it. I would be really grateful if someone can help me out with this.

I have tried using PyPDF2, Pymupdf libraries but I couldn't figure out the solution. I also tried highlighting by providing coordinates which works but couldn't find a way to get these coordinates as output.

[![Sample snapshot from the code[![\]\[1\]][1]][1]][1]

#import PyPDF2
import os
import fitz
from wand.image import Image
import csv
#import re
#from pdf2image import convert_from_path

check = r'C:\Users\Pradyumna.M\Desktop\Pradyumna\Automation\Intel Bytes\Create Source Docs\Sample Check 8 Apr 2019'

dir1 = check + '\\Source Docs\\'
dir2 = check + '\\Output\\'

dir = [dir1, dir2]

for x in dir:
    try:
        os.mkdir(x)
    except FileExistsError:
        print("Directory ", x, " already exists")

### READ PDF FILE
with open('upload1.csv', newline='') as myfile:
    reader = csv.reader(myfile)
    for row in reader:
        rowarray = '; '.join(row)
        src = rowarray.split("; ")
        file = check + '\\' + src[4] + '.pdf'
        print(file)
        #pdfFileObj = open(file,'rb')
        #pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        #print("Total number of pages: " + str(pdfReader.numPages))
        doc = fitz.open(file)
        print(src[5])
        for i in range(int(src[5])-1, int(src[5])):
            i = int(i)
            page = doc[i]
            print("Processing page: " + str(i))
            text = src[3]
            #SEARCH TEXT
            print("Searching: " + text)
            text_instances = page.searchFor(text)
            for inst in text_instances:
                highlight = page.addHighlightAnnot(inst)
                file1 = check + '\\Output\\' + src[4] + '_output.pdf'
                print(file1)
                doc.save(file1, garbage=4, deflate=True, clean=True)
                ### Screenshot
                with(Image(filename=file1, resolution=150)) as source:
                    images = source.sequence
                    newfilename = check + "\\Source Docs\\" + src[0] + '.jpeg'
                    Image(images[i]).save(filename=newfilename)
                    print("Screenshot of " + src[0] + " saved")
Masoud Rahimi
  • 5,785
  • 15
  • 39
  • 67
Godfrey
  • 87
  • 1
  • 8
  • Hello, what have you tried ? Have you reached a particular problem ? –  Apr 16 '19 at 07:35
  • @reportgunner I have tried the above libraries. My problem is that i am unable to extract coordinates of the highlighted text from pdf files. – Godfrey Apr 16 '19 at 09:31
  • have a look at [this](https://stackoverflow.com/questions/22898145/how-to-extract-text-and-text-coordinates-from-a-pdf-file) and [this](https://stackoverflow.com/questions/25248140/how-does-one-obtain-the-location-of-text-in-a-pdf-with-pdfminer) –  Apr 16 '19 at 10:10
  • @reportgunner Thanks for the links. Really appreciate it. – Godfrey Apr 18 '19 at 04:14

1 Answers1

10

"couldn't find a way to get these coordinates as output" - you can get the coordinates out by doing this:

for inst in text_instances:
    print(inst)

inst are fitz.Rect objects which contain the top left and bottom right coordinates of the piece of text that was found. All the information is available in the docs.

I managed to highlight points and also save a cropped region using the following snippet of code. I am using python 3.7.1 and my output for fitz.version is ('1.14.13', '1.14.0', '20190407064320').

import fitz

doc = fitz.open("foo.pdf")
inst_counter = 0
for pi in range(doc.pageCount):
    page = doc[pi]

    text = "hello"
    text_instances = page.searchFor(text)

    five_percent_height = (page.rect.br.y - page.rect.tl.y)*0.05

    for inst in text_instances:
        inst_counter += 1
        highlight = page.addHighlightAnnot(inst)

        # define a suitable cropping box which spans the whole page 
        # and adds padding around the highlighted text
        tl_pt = fitz.Point(page.rect.tl.x, max(page.rect.tl.y, inst.tl.y - five_percent_height))
        br_pt = fitz.Point(page.rect.br.x, min(page.rect.br.y, inst.br.y + five_percent_height))
        hl_clip = fitz.Rect(tl_pt, br_pt)

        zoom_mat = fitz.Matrix(2, 2)
        pix = page.getPixmap(matrix=zoom_mat, clip = hl_clip)
        pix.writePNG(f"pg{pi}-hl{inst_counter}.png")

doc.close()

I tested this on a sample pdf that i peppered with "hello": Input image

Some of the outputs from the script: pg2-hello1 pg2-hello5

I composed the solution out of the following pages of the documentation:

  • Tutorial page to get introduced into the library
  • page.searchFor to figure out the return type of the searchFor method
  • fitz.Rect to understand what the returned objects from page.searchFor are
  • Collection of Recipes page (called faq in the URL) to figure out how to crop and save part of a pdf page
SpaceMonkey55
  • 438
  • 4
  • 14