I'm trying to get cropped boxes from a pdf that has text in, this will be very usefull to gather training data for one of my models and that's why I need it. Here's a pdf sample: https://github.com/tomasmarcos/tomrep/blob/tomasmarcos-example2delete/example%20-%20Git%20From%20Bottom%20Up.pdf ; for example I would like to get the first boxtext within as an image (jpg or whatever), like this:
What I tried so far is the following code, but I'm open to solve this in other ways so if you have another way, it's nice. This code is a modified version from a solution (first answer) that I found here How to extract text and text coordinates from a PDF file? ; (only PART I of my code) ; part II is what I tried but didn't work so far, I also tried to read the image with pymupdf but didn't change anything at all (I won't post this attempt since the post is large enough).
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer
import os
import pandas as pd
import pdf2image
import numpy as np
import PIL
from PIL import Image
import io
# pdf path
pdf_path ="example - Git From Bottom Up.pdf"
# PART 1: GET LTBOXES COORDINATES IN THE IMAGE
# Open a PDF file.
fp = open(pdf_path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()
# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# here is where i stored the data
boxes_data = []
page_sizes = []
def parse_obj(lt_objs, verbose = 0):
# loop over the object list
for obj in lt_objs:
# if it's a textbox, print text and location
if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
if verbose >0:
print("%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text()))
data_dict = {"startX":round(obj.bbox[0]),"startY":round(obj.bbox[1]),"endX":round(obj.bbox[2]),"endY":round(obj.bbox[3]),"text":obj.get_text()}
boxes_data.append(data_dict)
# if it's a container, recurse
elif isinstance(obj, pdfminer.layout.LTFigure):
parse_obj(obj._objs)
# loop over all pages in the document
for page in PDFPage.create_pages(document):
# read the page into a layout object
interpreter.process_page(page)
layout = device.get_result()
# extract text from this object
parse_obj(layout._objs)
mediabox = page.mediabox
mediabox_data = {"height":mediabox[-1], "width":mediabox[-2]}
page_sizes.append(mediabox_data)
Part II of the code, getting the cropped box in image format.
# PART 2: NOW GET PAGE TO IMAGE
firstpage_size = page_sizes[0]
firstpage_image = pdf2image.convert_from_path(pdf_path,size=(firstpage_size["height"],firstpage_size["width"]))[0]
#show first page with the right size (at least the one that pdfminer says)
firstpage_image.show()
#first box data
startX,startY,endX,endY,text = boxes_data[0].values()
# turn image to array
image_array = np.array(firstpage_image)
# get cropped box
box = image_array[startY:endY,startX:endX,:]
convert2pil_image = PIL.Image.fromarray(box)
#show cropped box image
convert2pil_image.show()
#print this does not match with the text, means there's an error
print(text)
As you see, coordinates of the box do not match with the image, maybe the problem is because that pdf2image is doing some trick with the image size or something like that but I specified the size of the image correctly so I don't know. Any solutions / suggestions are more than welcome. Thanks in adavance.