I want to extract images from PDFs retaining a knowledge of their content (page_number and coordinates on page). (Some tools (e.g. pdfminer
) only emit image files with non-semantic names, e.g. Img0.bmp
). I can do this with PDFBox
(Java) but I'd ideally like a Python tool
My current (arbitrary) designs is to create filenames of the form:
image_<page>_<serial_in_page>_<x1>_<x2>__<y1>_<y2>.png
Currently pdfplumber
exposes cooordinates but with a PDFStream
and encoding information rather than an image. Code to convert the stream to a *.png
would solve the problem.
(NOTE: the pdfplumber
approach of rendering to the screen and capturing the known rectangle (which I use) is not a solution as the image is often degraded and frequently overwritten with text.)
(NOTE: I have had problems with several Python tools (pdfminer.six
, PuMuPDF
) extracting images as they make the background black which obscures black text, etc. PDFBox
(Java) doesn't have this problem.)