0

I need to extract images from a pdf without losing its location in the pdf. I need to know which page the image is on and where in the text the image is located, and then save the text and images in the pdf to a json file with the sequence of the data intact.

1 Answers1

0

You could use pdfplumber and run this code:

import pdfplumber

pdf_obj = pdfplumber.open(doc_path)
page = pdf_obj.pages[page_no]
images_in_page = page.images
page_height = page.height
image_bbox = (image['x0'], page_height - image['y1'], image['x1'], page_height - image['y0'])
cropped_page = page.crop(image_bbox)
image_obj = cropped_page.to_image(resolution=400)
image_obj.save(path_to_save_image)

Source link here

Nicolas
  • 26
  • 2