8

I want to delete / remove all the images in a PDF leaving only the text / font in the PDF with whatever command Line tool possible.

I tried using -dGraphicsAlphaBits=1 in a Ghostscript command but the images are present but like a big pixel.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
hussainb
  • 1,218
  • 2
  • 15
  • 33

5 Answers5

21

You can use the draft option of cpdf:

cpdf -draft in.pdf -o out.pdf

This should work in most situations, but file a bug report if it doesn't do the right thing for you.

Disclosure: I am the author of cpdf.

Cody Piersall
  • 8,312
  • 2
  • 43
  • 57
johnwhitington
  • 2,308
  • 1
  • 16
  • 18
  • 2
    Thanks, This works quiet well, It successfully removes all the images from the pdf. Next I tried to remove fonts from the pdf using command `cpdf -remove-fonts in.pdf -o out.pdf` but it leaves corrupted fonts / black blobs. Will look into that. – hussainb Dec 20 '13 at 17:11
  • 1
    I tried the same technique, and all of the images did get removed, but the text is selectable but not visible. Any idea how to deal with that? – whereswalden Mar 19 '15 at 00:39
18

Time has passed, and development of Ghostscript has progressed...

The latest releases have the following new command line parameters. These can be added to the command line:

  1. -dFILTERIMAGE: produces an output where all raster drawings are removed.

  2. -dFILTERTEXT: produces an output where all text elements are removed.

  3. -dFILTERVECTOR: produces an output where all vector drawings are removed.

Any two of these options can be combined.

Example command:

gs -o noimage.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf

More details (including some illustrative screenshots) can be found in my answer to "How can I remove all images from a PDF?".

Community
  • 1
  • 1
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
4

No, AFAIK, it's not possible to remove all images in a PDF with a commandline tool.

What's the purpose of your request anyway? Save on filesize? Remove information contained in images? Or ...?

Workaround

Whatever you aim at, here is a command that will downsample all images to a resolution of 2 ppi (Update: 1 ppi doesn't work). Which achieves two goals at once:

  • reduce filesize
  • make all images basically un-comprehendable

Here's how to do it selectively, for only the images on page 33 of original.pdf:

gs                               \
  -o images-uncomprehendable.pdf \
  -sDEVICE=pdfwrite              \
  -dDownsampleColorImages=true   \
  -dDownsampleGrayImages=true    \
  -dDownsampleMonoImages=true    \
  -dColorImageResolution=2       \
  -dGrayImageResolution=2        \
  -dMonoImageResolution=2        \
  -dFirstPage=33                 \
  -dLastPage=33                  \
   original.pdf

If you want to do it for all images on all pages, just skip the -dFirstPage and -dLastPage parameters.

If you want to remove all color information from images, convert them to Grayscale in the same command (search other answers on Stackoverflow where details for this are discussed).


Update: Originally, I had proposed to use a resolution of 1 PPI. It seems this doesn't work with Ghostscript. I now tested with 2 PPI. This works.


Update 2: See also the following (new) question with the answer:

It provides some sample PostScript code which completely removes all (raster) images from the PDF, leaving the rest of the page layout unchanged.

It also reflects the expanded new capabilities of Ghostscript which can now selectively remove either all text, or all raster images, or all vector objects from a PDF, or any combination of these 3 types.

Community
  • 1
  • 1
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • Thanks a lot @Kurt, I really wanted you to answer my question as it seems you are the only expert into processing pdfs. Actually my final aim is to generate two images, one containing the image layer and the other image containing only the text layer. Removing the background is just an effort for the final aim. – hussainb Dec 19 '13 at 16:43
  • But it actually is possible via a commandline tool, e. g. via cpdf. And there may be plenty of reasons why it is done - for example I can give you my reason, which is why I searched for this - I need to prepare for an exam but the image files are useless after knowing them already, so I just focus on the text first; and then as the second step, from that text, make notes what is worthy to be memorized and what is not. I can also think of many more possible reasons but I think on stackoverflow it is best to not ask WHY but to simply provide a solution that works. – shevy Mar 11 '17 at 23:56
  • @shevy: Please take note of the following facts: (1) The OP asked specifically for a Ghostscript or an ImageMagick solution. (2) My answer provided exactly what was asked for. I did provide it *after* John Whitington's pointing to `cpdf` (his own self-made tool, which is excellent!), because cpdf is not universally available as is Ghostscript (3) `cpdf` is a payware tool. Even though there is a free-of-charge version ("community edition"), this one is only legal to use for non-commercial purposes. (4) I did not ask for *your* reason -- I asked the OP, because it may be useful to know in... – Kurt Pfeifle Mar 12 '17 at 12:44
  • @shevy: (/continued) ...in order to shape the answer accordingly. For example, if the main purpose of this question was to minimize file size then there may be other (additional) methods than just to remove images... (5) *"StackOverflow [...] is best to [...] simply provide a solution that works"*. Thanks for the hint, mate. I would never have thought of that. Looking forward for all YOUR solutions that work! (6) And thanks for your downvote, anyway. – Kurt Pfeifle Mar 12 '17 at 12:48
4
 gs -o noImages.pdf   -sDEVICE=pdfwrite -dFILTERIMAGE                input.pdf
 gs -o noText.pdf   -sDEVICE=pdfwrite -dFILTERTEXT                 input.pdf
 gs -o noVectors.pdf   -sDEVICE=pdfwrite -dFILTERVECTOR               input.pdf
 gs -o onlyImages.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERTEXT  input.pdf
 gs -o onlyText.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE input.pdf
 gs -o onlyVectors.pdf -sDEVICE=pdfwrite -dFILTERIMAGE  -dFILTERTEXT  input.pdf
kiran beethoju
  • 141
  • 1
  • 4
2

To separate images and text to different layers, unfortunately there is no Free/Open Source Software utility available. Also not a free-as-in-beer one either...

This task can only be achieved with various payware software solutions. Since you didn't exclude this in your question, but you asked for 'whatever commandline tool possible', I'll tell you my favorite one:

A version for CLI usage (which includes a powerful SDK enabling lots of low-level PDF manipulations) is available, and this is supported on all major OS platforms, including Linux.

callas offers you a fully featured gratis test license which is enabled for (I believe) 14 days.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • I too understand that it may not be possible to find an easy way. But I got partial success in generating background only image using imageMagick. i just used "-blur 0x0" and it generated a background only image. I understand that its not the proper way and results may vary between pdfs. am just trying if I can manage to reverse the effect so that text remains the next time. I will definitely tryout 'callas', its trial-ware for 7 days. I might end-up buying it if it works as expected,and if it isn't too heavy on the pocket. – hussainb Dec 20 '13 at 07:16
  • ImageMagick processes raster images only. In so far as it takes PDF as input... no, it doesn't take PDF itself, it calls Ghostscript as its *delegate* to convert the pages to a series of images first; for outputting PDF it again wraps the raster image into a thin PDF shell. Once data passes through ImageMagick, you only have raster data left. Just like after you turn a steak into minced meat: there is no way back to the original steak any more. I can tell you for sure that there is no way to employ ImageMagick to separate text and images occurring on the same PDF page into separate layers... – Kurt Pfeifle Dec 20 '13 at 11:09
  • 1
    @codin: I'm now not so sure if we have the same understanding of 'layers' for a PDF. In the PDF specification, layers are also named *Optional Content Groups* (OCG). Do you mean this? – Kurt Pfeifle Dec 20 '13 at 11:11
  • @codin: Can you supply a sample PDF (with one or only a *few* pages) where you want to separate images and text into different layers? – Kurt Pfeifle Dec 20 '13 at 11:13
  • @codin: I seriously doubt that ImageMagick's `-blur 0x0` will turn a mixed text/image PDF page into a file where you only see pixels from the image, and none from text.... – Kurt Pfeifle Dec 20 '13 at 11:14
  • Just finished installing GS and imageMagick on my home pc, I tried the `convert -blur 0x0 in.pdf out.png` on the same pdf, but it doesnt produce the image only output here. looks like a bug at my work pc. – hussainb Dec 20 '13 at 16:56
  • @codin: Even if it works, it will not get you anywhere, because you'll have image **and** text in one output file. Having said that, depending on your version of IM, you may need to use a different order of the command line arguments: `convert in.pdf -blur 0x0 out.png`. – Kurt Pfeifle Dec 20 '13 at 22:08