2

I'm trying to get cropped boxes from a pdf that has text in, this will be very usefull to gather training data for one of my models and that's why I need it. Here's a pdf sample: https://github.com/tomasmarcos/tomrep/blob/tomasmarcos-example2delete/example%20-%20Git%20From%20Bottom%20Up.pdf ; for example I would like to get the first boxtext within as an image (jpg or whatever), like this:

enter image description here

What I tried so far is the following code, but I'm open to solve this in other ways so if you have another way, it's nice. This code is a modified version from a solution (first answer) that I found here How to extract text and text coordinates from a PDF file? ; (only PART I of my code) ; part II is what I tried but didn't work so far, I also tried to read the image with pymupdf but didn't change anything at all (I won't post this attempt since the post is large enough).

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer
import os
import pandas as pd
import pdf2image
import numpy as np
import PIL
from PIL import Image
import io

# pdf path 
pdf_path ="example - Git From Bottom Up.pdf"

# PART 1: GET LTBOXES COORDINATES IN THE IMAGE
# Open a PDF file.
fp = open(pdf_path, 'rb')

# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)

# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)

# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
    raise PDFTextExtractionNotAllowed

# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()

# Create a PDF device object.
device = PDFDevice(rsrcmgr)

# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()

# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)

# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)


# here is where i stored the data
boxes_data = []
page_sizes = []

def parse_obj(lt_objs, verbose = 0):
    # loop over the object list
    for obj in lt_objs:
        # if it's a textbox, print text and location
        if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
            if verbose >0:
                print("%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text()))
            data_dict = {"startX":round(obj.bbox[0]),"startY":round(obj.bbox[1]),"endX":round(obj.bbox[2]),"endY":round(obj.bbox[3]),"text":obj.get_text()}
            boxes_data.append(data_dict)
        # if it's a container, recurse
        elif isinstance(obj, pdfminer.layout.LTFigure):
            parse_obj(obj._objs)

# loop over all pages in the document
for page in PDFPage.create_pages(document):
    # read the page into a layout object
    interpreter.process_page(page)
    layout = device.get_result()
    # extract text from this object
    parse_obj(layout._objs)
    mediabox = page.mediabox
    mediabox_data = {"height":mediabox[-1], "width":mediabox[-2]}
    page_sizes.append(mediabox_data)

Part II of the code, getting the cropped box in image format.

# PART 2: NOW GET PAGE TO IMAGE
firstpage_size = page_sizes[0]
firstpage_image = pdf2image.convert_from_path(pdf_path,size=(firstpage_size["height"],firstpage_size["width"]))[0]
#show first page with the right size (at least the one that pdfminer says)
firstpage_image.show()

#first box data
startX,startY,endX,endY,text = boxes_data[0].values()
# turn image to array
image_array = np.array(firstpage_image)
# get cropped box
box = image_array[startY:endY,startX:endX,:]
convert2pil_image = PIL.Image.fromarray(box)
#show cropped box image
convert2pil_image.show()
#print this does not match with the text, means there's an error
print(text)

As you see, coordinates of the box do not match with the image, maybe the problem is because that pdf2image is doing some trick with the image size or something like that but I specified the size of the image correctly so I don't know. Any solutions / suggestions are more than welcome. Thanks in adavance.

Tom
  • 496
  • 8
  • 16

1 Answers1

2

I've checked the coordinates of first two boxes from first part of your code and they more or less fit to the text on the page:

enter image description here

But are you aware that zero point in PDF placed in bottom-left corner? Maybe this is a cause of the problem.

Unfortunately I didn't managed to test the second part of the code. pdf2image gets me some error.

But I'm almost sure that PIL.Image has zero point in top-left corner not like PDF. You can convert pdf_Y to pil_Y with formula:

pil_Y = page_height - pdf_Y

Page height in your case is 792 pt. And you can get page height with script as well.

Coordinates

enter image description here


Update

Nevertheless after a couple hours that I spend to install all the modules (it was a hardest part!) I make your script to work to some extent.

Basically I was right: coordinates were inverted y => h - y because PIL and PDF have different positions of zero point.

And there was another thing. PIL makes images with resolution 200 dpi (probably it can be changed somewhere). PDF measures everything in points (1 pt = 1/72 dpi). So if you want to use PDF sizes in PIL, you need to change PDF sizes this way: x => x * 200 / 72.

Here is the fixed code:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer
import os
import pandas as pd
import pdf2image
import numpy as np
import PIL
from PIL import Image
import io
from pathlib import Path # it's just my favorite way to handle files

# pdf path
# pdf_path ="test.pdf"
pdf_path = Path.cwd()/"Git From Bottom Up.pdf"


# PART 1: GET LTBOXES COORDINATES IN THE IMAGE ----------------------
# Open a PDF file.
fp = open(pdf_path, 'rb')

# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)

# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)

# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
    raise PDFTextExtractionNotAllowed

# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()

# Create a PDF device object.
device = PDFDevice(rsrcmgr)

# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()

# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)

# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)


# here is where i stored the data
boxes_data = []
page_sizes = []

def parse_obj(lt_objs, verbose = 0):
    # loop over the object list
    for obj in lt_objs:
        # if it's a textbox, print text and location
        if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
            if verbose >0:
                print("%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text()))
            data_dict = {
                "startX":round(obj.bbox[0]),"startY":round(obj.bbox[1]),
                "endX":round(obj.bbox[2]),"endY":round(obj.bbox[3]),
                "text":obj.get_text()}
            boxes_data.append(data_dict)
        # if it's a container, recurse
        elif isinstance(obj, pdfminer.layout.LTFigure):
            parse_obj(obj._objs)

# loop over all pages in the document
for page in PDFPage.create_pages(document):
    # read the page into a layout object
    interpreter.process_page(page)
    layout = device.get_result()
    # extract text from this object
    parse_obj(layout._objs)
    mediabox = page.mediabox
    mediabox_data = {"height":mediabox[-1], "width":mediabox[-2]}
    page_sizes.append(mediabox_data)

# PART 2: NOW GET PAGE TO IMAGE -------------------------------------
firstpage_size = page_sizes[0]
firstpage_image = pdf2image.convert_from_path(pdf_path)[0] # without 'size=...'
#show first page with the right size (at least the one that pdfminer says)
# firstpage_image.show()
firstpage_image.save("firstpage.png")

# the magic numbers
dpi = 200/72
vertical_shift = 5 # I don't know, but it's need to shift a bit
page_height = int(firstpage_size["height"] * dpi)

# loop through boxes (we'll process only first page for now)
for i, _ in enumerate(boxes_data):

    #first box data
    startX, startY, endX, endY, text = boxes_data[i].values()

    # correction PDF --> PIL
    startY = page_height - int(startY * dpi) - vertical_shift
    endY   = page_height - int(endY   * dpi) - vertical_shift
    startX = int(startX * dpi)
    endX   = int(endX   * dpi)
    startY, endY = endY, startY 

    # turn image to array
    image_array = np.array(firstpage_image)
    # get cropped box
    box = image_array[startY:endY,startX:endX,:]
    convert2pil_image = PIL.Image.fromarray(box)
    #show cropped box image
    # convert2pil_image.show()
    png = "crop_" + str(i) + ".png"
    convert2pil_image.save(png)
    #print this does not match with the text, means there's an error
    print(text)

The code almost all the same as yours. I just added the coordinates correction and save PNG files rather than show them.

Output:

enter image description here

Gi from the bottom up

Wed,  Dec 9

by John Wiegley

In my pursuit to understand Git, it’s been helpful for me to understand it from the bottom
up — rather than look at it only in terms of its high-level commands. And since Git is so beauti-
fully simple when viewed this way, I thought others might be interested to read what I’ve found,
and perhaps avoid the pain I went through nding it.

I used Git version 1.5.4.5 for each of the examples found in this document.

1.  License
2.  Introduction
3.  Repository: Directory content tracking

Introducing the blob
Blobs are stored in trees
How trees are made
e beauty of commits
A commit by any other name…
Branching and the power of rebase
4.  e Index: Meet the middle man

Taking the index farther
5.  To reset, or not to reset

Doing a mixed reset
Doing a so reset
Doing a hard reset

6.  Last links in the chain: Stashing and the reog
7.  Conclusion
8.  Further reading

2
3
5
6
7
8
10
12
15
20
22
24
24
24
25
27
30
31

Of course the fixed code is more like a prototype. Not for sale. )

Yuri Khristich
  • 13,448
  • 2
  • 8
  • 23
  • Thanks for the answer and checking that the first part of the code works; it is very useful for me even if it is not the final answer. What do you mean by if I'm aware that zero point in PDF placed in bottom-left corner? – Tom Jun 16 '21 at 14:54
  • I just added one more picture to my answer – Yuri Khristich Jun 16 '21 at 15:16
  • @YuriKhristich , thanks for the clear explanation. I checked that out with the following code and this does not seem to be the problem. I used the following code to make sure the first 300 pixels do match with the top left of the image. image_array = np.array(pil_image) box = image_array[0:300,0:300,:] convert2pil_image = PIL.Image.fromarray(box) convert2pil_image.show() – Tom Jun 16 '21 at 16:22
  • Just to be sure, try to add after the line: `startX,startY,endX,endY,text = boxes_data[0].values()` this lines `startY = 792-startY` and `endY = 792-endY` – Yuri Khristich Jun 16 '21 at 17:08
  • 1
    I'll try to remake the second part of the code from scratch. Since, if coordinates are correct, the task -- to cut out several images from pdf page -- looks rather trivial. – Yuri Khristich Jun 16 '21 at 17:41
  • 1
    It works perfectly for many of my pdfs. So this works without loss of generality on other images. Nice!! By the way, the dpi = 200 is settled by default in the pdf2image.convert_from_path() ; that's why you had to adjust that. About the shift, I'll do my research about it and tell you but this is a great advance. Thanks!! – Tom Jun 17 '21 at 13:22
  • Good luck! And probably it makes sense to expand these box coordinates a bit. To add about +5 pixels on each side I think. It will prevent the images from too close cropping. – Yuri Khristich Jun 17 '21 at 13:58
  • Hahah sure I will do that! Thank you so much for you help!! – Tom Jun 17 '21 at 14:02