I am trying to extract images in PDF with BBox coordinates of the image.
I tried using pdfrw library, it is identifying image objects and it have an attribute called media box which have some coordinates, i am not sure if those are correct bbox coordinates since for some pdfs it is showing something like this ['0', '0', '684', '864'] but image doesn't start at the start of the page, so i don't think it is bbox
I tried with following code using pdfrw
import pdfrw, os
from pdfrw import PdfReader, PdfWriter
from pdfrw.findobjs import page_per_xobj
outfn = 'extract.' + os.path.basename(path)
pages = list(page_per_xobj(PdfReader(path).pages, margin=0.5*72))
writer = PdfWriter(outfn)
writer.addpages(pages)
writer.write()
How do i get image along with it's bbox coordinates?
sample pdf : https://drive.google.com/open?id=1IVbj1b3JfmSv_BJvGUqYvAPVl3FwC2A-