I need to extract images from a pdf without losing its location in the pdf. I need to know which page the image is on and where in the text the image is located, and then save the text and images in the pdf to a json file with the sequence of the data intact.
Asked
Active
Viewed 847 times
0
-
1What have you already tried? – user_na Jun 23 '21 at 18:35
-
1Does this answer your question? [Extract images from PDF without resampling, in python?](https://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python) – user_na Jun 23 '21 at 18:38
-
No, I do care about where the source image is located on the page :( – Manasi Gowda Jun 23 '21 at 18:46
1 Answers
0
You could use pdfplumber and run this code:
import pdfplumber
pdf_obj = pdfplumber.open(doc_path)
page = pdf_obj.pages[page_no]
images_in_page = page.images
page_height = page.height
image_bbox = (image['x0'], page_height - image['y1'], image['x1'], page_height - image['y0'])
cropped_page = page.crop(image_bbox)
image_obj = cropped_page.to_image(resolution=400)
image_obj.save(path_to_save_image)
Source link here

Nicolas
- 26
- 2