6

I am trying to extract images in PDF with BBox coordinates of the image.

I tried using pdfrw library, it is identifying image objects and it have an attribute called media box which have some coordinates, i am not sure if those are correct bbox coordinates since for some pdfs it is showing something like this ['0', '0', '684', '864'] but image doesn't start at the start of the page, so i don't think it is bbox

I tried with following code using pdfrw

import pdfrw, os
from pdfrw import PdfReader, PdfWriter
from pdfrw.findobjs import page_per_xobj
outfn = 'extract.' + os.path.basename(path)
pages = list(page_per_xobj(PdfReader(path).pages, margin=0.5*72))
writer = PdfWriter(outfn)
writer.addpages(pages)
writer.write()

How do i get image along with it's bbox coordinates?

sample pdf : https://drive.google.com/open?id=1IVbj1b3JfmSv_BJvGUqYvAPVl3FwC2A-

Satyaaditya
  • 537
  • 8
  • 26

1 Answers1

11

I found a way to do it through a library called pdfplumber. It's built on top of pdfminer and is working consistently in my use-case. And moreover, its MIT licensed so it is helpful for my office work.

    import pdfplumber

    pdf_obj = pdfplumber.open(doc_path)
    page = pdf_obj.pages[page_no]
    images_in_page = page.images
    page_height = page.height
    image = images_in_page[0] # assuming images_in_page has at least one element, only for understanding purpose. 
    image_bbox = (image['x0'], page_height - image['y1'], image['x1'], page_height - image['y0'])
    cropped_page = page.crop(image_bbox)
    image_obj = cropped_page.to_image(resolution=400)
    image_obj.save(path_to_save_image)

    

Worked well for tables and images in my case.

Satyaaditya
  • 537
  • 8
  • 26
  • 1
    on your code the image_bbox should be inside a loop something like; for image in images_in_page: image_bbox = (image['x0'], page_height - image['y1'], image['x1'], page_height - image['y0']) – bherto39 May 12 '21 at 06:34
  • 1
    you are actually right, i thought of making it generic and missed that, thanks for correcting – Satyaaditya May 12 '21 at 07:13