0

I have pdf files created on Adobe InDesign, which contains several kinds of object.

I wish to "Remove everything except the largest art (background)" and save to a new file.

This can be achieved almost perfectly with combination of gs command options like this:

$ gs -q -o "${outfile}" -sDEVICE=pdfwrite -dFILTERTEXT -dFILTERVECTOR "${infile}"

However, occasionally there are several raster images which I'd like to filter by some criteria. The image I wish to keep, always have following properties that distinguish it from the unwanted ones:

  • it is always the largest in data size
  • it is always the largest by dimension
  • it is always a 1bit binary images, sometimes compressed with CCITT g4 or LZW

Is there a way to filter certain images by criteria, by using gs command or any other tools perhaps?

This link guided me to using the filter options, but could not work out how I could extend the functionality further: How can I remove all images from a PDF?

The code runs on a Mac/Ubuntu envrironment.

taiyodayo
  • 331
  • 4
  • 13
  • There are problems with what you are asking for. When you see an image in a PDF file, how do you know if it's the largest, or if another image will follow later which is still larger ? You can't know that so you can't use that as a criterion. You could, possibly, do a two pass approach by first finding the dimensions of the images in the file, then passing those to some device (such as the object filtering device) and having it filter out all images which don't match those dimensions. Duplicate images of the maximum dimension are still a problem of course. – KenS Oct 01 '21 at 10:27
  • 2
    And no, this isn't something Ghostscript can do now, nor would it be likely to be added in the future, too niche a requirement. – KenS Oct 01 '21 at 10:27
  • Thanks for the comments! Great to hear from you the master of ghostscript, that ghostscript my not be the best tool for my task. trying to see how best I can achieve the 2-pass approach. If you have any suggestions for other tools that'll be appreciated :) – taiyodayo Oct 04 '21 at 07:44
  • 1
    You could try using MuPDF, you can script that with JS, but I'm not very familiar with it. You can certainly get the image dimensions using it. Other than that I have no idea, but there are many tools that will do 'stuff' with PDF files. It wouldn't be hard to modify the Ghostscript object filtering device to discard images based on dimensions, but you'd have to modify it by writing C code and recompiling. PS not the master of Ghostscript by any means, just the bits I'm responsible for! – KenS Oct 04 '21 at 08:29
  • many many thanks for the suggestions! will try the tools suggested. was in fact trying out writing some js to manipulate Adobe AI, will try reading ghostscript source just for the fun of it! – taiyodayo Oct 05 '21 at 09:01
  • 1
    You can ask questions about MuPDF on the #mupdf Discord channel on the Artifex Discord server (Ghostscript is on the #ghostscript channel) as well as on IRC (irc.libera.chat) in #ghostscript and #mupdf. While I have no real clue if MuPDF can do what you want, other people there know much more than me :-) – KenS Oct 05 '21 at 14:53
  • Just confirming I can use MuPDF to do exactly what I needed! (deleting specific object from PDF, and save it out) using JavaScripting capabilities of ```mutool run```!! ☺ Need to work a bit more to extract object info and make it work automatically from it, once I do, I'll post my own answer! – taiyodayo Oct 08 '21 at 05:25

0 Answers0