Is there a way to extract images from a pdf in Python while preserving the location of the image in the pdf?

Question

I need to extract images from a pdf without losing its location in the pdf. I need to know which page the image is on and where in the text the image is located, and then save the text and images in the pdf to a json file with the sequence of the data intact.

Does this answer your question? [Extract images from PDF without resampling, in python?](https://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python) — user_na, Jun 23 '21 at 18:38
No, I do care about where the source image is located on the page :( — Manasi Gowda, Jun 23 '21 at 18:46

score 0 · Answer 1 · answered Jun 23 '21 at 18:47

You could use pdfplumber and run this code:

import pdfplumber

pdf_obj = pdfplumber.open(doc_path)
page = pdf_obj.pages[page_no]
images_in_page = page.images
page_height = page.height
image_bbox = (image['x0'], page_height - image['y1'], image['x1'], page_height - image['y0'])
cropped_page = page.crop(image_bbox)
image_obj = cropped_page.to_image(resolution=400)
image_obj.save(path_to_save_image)

Source link here

Is there a way to extract images from a pdf in Python while preserving the location of the image in the pdf?

1 Answers1