12

I want to remove all images from a PDF file.

The page layouts should not change. All images should be replaced by empty space.

  • How can this be achieved with the help of Ghostscript and the appropriate PostScript code?
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • So who-the-hell thought he better downvoted this question? For what reason?!? Feel free to downvote, but please give a comment and tell me why? – Kurt Pfeifle Apr 15 '15 at 18:02

2 Answers2

21

Meanwhile the latest Ghostscript releases have a much nicer and easier to use method of removing all images from a PDF. The parameter to add to the command line is -dFILTERIMAGE

 gs -o noimages.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf

Even better, you can also remove all text or all vector drawing elements from a PDF by specifying -dFILTERTEXT or -dFILTERVECTOR.

Of course, you can also combine any combination of these -dFILTER* parameters you want in order to achieve a required result. (Combining all three will of course result in "empty" pages.)

Here is the screenshot from an example PDF page which contains all 3 types of content mentioned above:


Screenshot of original PDF page containing "image", "vector" and "text" elements.
Screenshot of original PDF page containing "image", "vector" and "text" elements.


Running the following 6 commands will create all 6 possible variations of remaining contents:

 gs -o noIMG.pdf   -sDEVICE=pdfwrite -dFILTERIMAGE                input.pdf
 gs -o noTXT.pdf   -sDEVICE=pdfwrite -dFILTERTEXT                 input.pdf
 gs -o noVCT.pdf   -sDEVICE=pdfwrite -dFILTERVECTOR               input.pdf

 gs -o onlyTXT.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE input.pdf 
 gs -o onlyIMG.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERTEXT  input.pdf
 gs -o onlyVCT.pdf -sDEVICE=pdfwrite -dFILTERIMAGE  -dFILTERTEXT  input.pdf

The following image illustrates the results:


Top row, from left: all "text" removed; all "images" removed; all "vectors" removed. Bottom row, from left: only "text" kept; only "images" kept; only "vectors" kept.
Top row, from left: all "text" removed; all "images" removed; all "vectors" removed. Bottom row, from left: only "text" kept; only "images" kept; only "vectors" kept.


Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • Can we remove specific vectors? If yes how to identify different vectors in the pdf itself. I tested this and it works but it also removes some vectors which I don't want. – Jay Chakra Aug 22 '16 at 11:20
  • @JayChakra: No, you cannot remove specific vectors. (You could limit the removal of all vectors to a certain page or range of pages, though, and then re-insert these pages into the original PDF document.) – Kurt Pfeifle Aug 22 '16 at 11:38
  • 1
    Your images don't seem ordered the way you entered the commands above. "Filtering" X here means not including X in the output, right? – Geremia Sep 01 '16 at 02:47
  • 2
    @Geremia: You were right about the order of the commands. I've changed it now, thank you. (At least the image capture had already held the correct descriptions.) About the names of the parameters: I agree that the *"FILTERxxx"* is not the best choice -- maybe naming them *"REMOVExxx"* would have been more user friendly. – Kurt Pfeifle Sep 01 '16 at 07:42
10

I'm putting up the answer myself, but the actual code is by courtesy of Chris Liddell, Ghostscript developer.

I used his original PostScript code and stripped off its other functions. Only the function which removes raster images remains. Other graphical page objects -- text sections, patterns and vector objects -- should remain untouched.

Copy the following code and save it as remove-images.ps:

%!PS

% Run as:
%
%      gs ..... -dFILTERIMAGE -dDELAYBIND -dWRITESYSTEMDICT \
%                 ..... remove-images.ps <your-input-file>
%
% derived from Chris Liddell's original 'filter-obs.ps' script
% Adapted by @pdfkungfoo (on Twitter)

currentglobal true setglobal

32 dict begin

/debugprint     { systemdict /DUMPDEBUG .knownget { {print flush} if} 
                {pop} ifelse } bind def

/pushnulldevice {
  systemdict exch .knownget not
  {
    //false
  } if

  {
    gsave
    matrix currentmatrix
    nulldevice
    setmatrix
  } if
} bind def

/popnulldevice {
  systemdict exch .knownget not
  {
    //false
  } if
  {
    % this is hacky - some operators clear the current point
    % i.e.
    { currentpoint } stopped
    { grestore }
    { grestore moveto} ifelse
  } if
} bind def

/sgd {systemdict exch get def} bind def

systemdict begin

/_image /image sgd
/_imagemask /imagemask sgd
/_colorimage /colorimage sgd

/image {
   (\nIMAGE\n) //debugprint exec /FILTERIMAGE //pushnulldevice exec
  _image
  /FILTERIMAGE //popnulldevice exec
} bind def

/imagemask
{
  (\nIMAGEMASK\n) //debugprint exec
  /FILTERIMAGE //pushnulldevice exec
  _imagemask
  /FILTERIMAGE //popnulldevice exec
} bind def

/colorimage
{
  (\nCOLORIMAGE\n) //debugprint exec
  /FILTERIMAGE //pushnulldevice exec
  _colorimage
  /FILTERIMAGE //popnulldevice exec
} bind def

end
end

.bindnow

setglobal

Now run this command:

gs -o no-more-images-in-sample.pdf \
   -sDEVICE=pdfwrite               \
   -dFILTERIMAGE                   \
   -dDELAYBIND                     \
   -dWRITESYSTEMDICT               \
    remove-images.ps               \
    sample.pdf

I tested the code with the official PDF specification, and it worked. The following two screenshots show page 750 of input and output PDFs:

If you wonder why something that looks like an image is still on the output page: it is not really a raster image, but a 'pattern' in the original file, and therefor it is not removed.

Community
  • 1
  • 1
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • FWIW I'm hoping to have a system level version of Chris's code built into GS in a future release. So this will be possible on all devices without additional work. Don't hold your breath though.... – KenS Apr 15 '15 at 19:35
  • @KenS: After I discovered link to Chris' code in the IRC logs a 2 hours ago I was hoping that he'd include it alongside other *.ps files in the GS `/lib/` subdir. What you let me hope for, is even better :) – Kurt Pfeifle Apr 15 '15 at 19:46
  • We won't include the PostScript as such, no. I'm working on some internal stuff which will work with all the interpreters. On the down side, I've been working on it for nearly a year now. – KenS Apr 16 '15 at 07:13
  • In the command given, the reference to `remove-images.ps` is missing - it should be the second to last argument, before `sample.pdf`. – akobel May 07 '15 at 16:27
  • @perpeduumimmobile: Ha!, you are right! Thanks for spotting + reporting it. – Kurt Pfeifle May 07 '15 at 16:34
  • @KenS: is the *"system level version of Chris's code"* now in Git sources, as part of the "subclassing" stuff? – Kurt Pfeifle Jul 09 '15 at 15:57
  • Well spotted, it is indeed, committed this very afternoon. Its the 'object filtering', but be aware that its not precisely the same as Chris' since it works in the graphics library, not the language. Though this does have the advantage that it works with all the possible input languages. – KenS Jul 09 '15 at 18:26