extracting images from PDF with page and screen coordinate information

Question

I want to extract images from PDFs retaining a knowledge of their content (page_number and coordinates on page). (Some tools (e.g. pdfminer) only emit image files with non-semantic names, e.g. Img0.bmp). I can do this with PDFBox (Java) but I'd ideally like a Python tool

My current (arbitrary) designs is to create filenames of the form:

image_<page>_<serial_in_page>_<x1>_<x2>__<y1>_<y2>.png

Currently pdfplumber exposes cooordinates but with a PDFStream and encoding information rather than an image. Code to convert the stream to a *.png would solve the problem.

(NOTE: the pdfplumber approach of rendering to the screen and capturing the known rectangle (which I use) is not a solution as the image is often degraded and frequently overwritten with text.)

(NOTE: I have had problems with several Python tools (pdfminer.six, PuMuPDF) extracting images as they make the background black which obscures black text, etc. PDFBox (Java) doesn't have this problem.)

Thanks, Yes, it can be complex and arbitrary. I can join banded images if the order or coordinates are available. I am certainly prepared to try other python tools and use them to extract to disk and then process the result. — peter.murray.rust, Jul 11 '22 at 17:14
Very useful to know how complex this can be. How common are these cases? Maybe I can ignore the times they occur in practice. In PDFBox (Java) this image extraction was "straightforward" as there was a bitstream with coordinates. (Maybe this should go in an answer?( — peter.murray.rust, Jul 11 '22 at 23:12
Thanks very much @K J . Fig 6.2 on p11 (6-9) of https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter06/fulltext.pdf , pdfminer gives https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter06/images/Im0.0.bmp (wrong colormap and background), PDFBox gives. https://github.com/petermr/semanticClimate/blob/main/ipcc/ar6/wg3/Chapter06/pdfimages/image.11.1.102_494.183_406.png (correct). Any pointers really appreciated. (Could upvote an answer if it helps) — peter.murray.rust, Jul 16 '22 at 11:24
@peter.murray.rust FYI: The script in my answer produces the result that PDFBox gives, see https://transfer.sh/YKUXVM/image.png — gettalong, Jul 16 '22 at 19:42
@KJ That's all done by HexaPDF itself. Support for extracting the combined image is rather recent, added in March this year. — gettalong, Jul 16 '22 at 20:49
@KJ I strive to avoid dependencies if not really needed. And since PNG is a well-documented, open format, it was not that hard to implement it, see https://github.com/gettalong/hexapdf/commit/3c8ab6be69791c5c8b457032bc73cd5950bb9afc. It also helps that parts of the PDF spec depend on the PNG spec (e.g. the Deflate filter with a predictor), so one already needs to implement some parts for basic PDF operations. — gettalong, Jul 16 '22 at 21:42

score 1 · Answer 1 · answered Jul 16 '22 at 19:33

I don't have a solution in Python but here is a small script using Ruby and HexaPDF:

require 'hexapdf'

class ImageBorderProcessor < HexaPDF::Content::Processor

  def initialize(page, index)
    super()
    @page = page
    @index = index
    @count = 0
  end

  def paint_xobject(name)
    super
    xobject = resources.xobject(name)
    return unless xobject[:Subtype] == :Image
    w, h = xobject.width, xobject.height
    llx, lly = graphics_state.ctm.evaluate(0, 0)
    lrx, lry = graphics_state.ctm.evaluate(1, 0)
    urx, ury = graphics_state.ctm.evaluate(1, 1)
    ulx, uly = graphics_state.ctm.evaluate(0, 1)
    # If the image is rotated, you will need all 4 coordinates, nut just the 2
    filename = "image_#{@index}_#{@count}_#{llx}_#{urx}_#{lly}_#{ury}"
    xobject.write(filename) rescue puts "Can write image #{@index}-#{@count}"
    @count += 1
  end

end

doc = HexaPDF::Document.open(ARGV[0])
doc.pages.each_with_index do |page, index|
  processor = ImageBorderProcessor.new(page, index)
  page.process_contents(processor)
end

It will iterate over all pages of the input document provided on the command line and create files using your file naming scheme. Since HexaPDF doesn't currently support writing all types of PDF images, you might get some error messages for those that can't be written.

If a supported image has an associated image mask defined, it will automatically be used to create a transparent image.

The script will output all images found, even repeated ones. This could easily be changed so that just a soft link is created for repeated images.

K J · Accepted Answer · 2022-07-17T15:02:18.030

Python tools are likely to have similar problems to any tools even those that require a single line to manipulate images or extract their details.

Here we can see a visual layout of all the compressed images in the file by using one command line to extract images. Here the individual object references have been converted into normal tiff or jpg (other tools may use pbm and pgm especially for OCR but the result is generally similar). The Greyscale Alpha softmask (B&W) transparency components are not necessarily tied direct to a page or an image other than by internal references, and usually appear like negatives.

What you may note is that the objects that were inserted most likely as one PNG are broken in two when injected into the PDF and their scaled placement is defined. Note that a raw PNG (whatever its source common resolution was) will retain number of dots but its scale when inserted into the PDF could be totally different horizontal and vertical, thus the only meaningful data is W x H in pixel values.

It is not trivial to overlay the mask on the RGB component when simply extracted but can allow for colour changes if desired.

So PDFbox is one of the simpler/better tools for blending to a suitable output, (as you have discovered) but for Python it is generally the top end library products that can identify the placement of the two images and combine into a suitable alpha output like a new PNG.

For many suggestions see Extract images from PDF without resampling, in python?.

Your related part question was knowing where those components are placed on each page since one image (and its alpha mask) could be placed multiple times such as a heading logo on each page. Again it is easy in a single command line to see which pages are referenced by a group of images, but to see which image is placed where requires analyzing each pages resources, again requiring a library interrogation of page contents, thus best done via power house libraries such as iText or any other like PDFtron for python.

For a related command in PyMuPDF see https://pymupdf.readthedocs.io/en/latest/page.html#Page.get_image_rects

extracting images from PDF with page and screen coordinate information

2 Answers2