3

In How can I remove all images from a PDF?, Kurt Pfeifle gave a piece of PostScript code (by courtesy of Chris Liddell) to filter out all bitmaps from a PDF, using GhostScript.

This works like a charm; however, I'm also interested in the companion task of removing everything except bitmaps from the PDF, and without recompressing bitmaps. Or, eventually, separating the vector and bitmap "layers." (I know, this is not what a layer is in PDF terminology.)

AFAIU, Kurt's filter works by sending all bitmaps to a null device, while leaving everything else to pdfwrite. I read that it is possible to use different devices with GS, so my hope is that it is possible to send everything to a fake/null device by default, and only switch to pdfwrite for those images which are captured by the filter. But unfortunately I'm completely unable to translate such a thing into PostScript code.

Can anyone help, or at least tell me if this approach might be doomed to fail?

Community
  • 1
  • 1
akobel
  • 198
  • 1
  • 8
  • Can't help with a ghostscript solution if that is what you are looking for, but I wanted to make you aware that there are very elegant PDF based solutions if you can use commercial tools. If you're interested in that too I can explain more. – David van Driessche May 07 '15 at 23:28
  • Thanks David. Indeed I'm looking for at least a free-as-in-beer-for-personal-use tool; not necessarily libre, though. So something like [CoherentPDF](http://community.coherentpdf.com/) in the community release would be fine (btw, it does the opposite direction quite nicely with the `-draft` option. But the closer to things already bundled in the main Linux distros the better, and Linux support is required. – akobel May 08 '15 at 07:12

1 Answers1

4

Its possible, but its a large amount of work.

You can't start with the nulldevice and push the pdfwrite device as needed, that simply won't work because the pdfwrite device will write out the accumulated PDF file as soon as you unload it. Reloadng it will start a new PDF file.

Also, you need the same instance of the pdfwrite device for all the code, so you can't load the pdfwrite device, load the nulldevice, then load the pdfwrite device again only for the bits you want. Which means the only approach which (currently) works is the one which Chris wrote. You need to load pdfwrite and push the null device into place whenever you want to silently consume an operation.

Just 'images' is quite a limited amount of change, because there aren't that many operators which deal with images.

In order to remove everything except images however, there are a lot of operators. You need to override; stroke, fill, eofill, rectstroke, rectfill, ustroke, ufill, ueofill, shfill, show, ashow, widthshow, awidthshow, xshow, xyshow, yshow, glyphshow, cshow and kshow. I might have missed a few operators but those are the basics at least.

Note that the code Chris originally posted did actually filter various types of objects, not just images, you can find his code here:

http://www.ghostscript.com/~chrisl/filter-obs.ps

Please be aware this is unsupported example code only.

KenS
  • 30,202
  • 3
  • 34
  • 51
  • That's great, thanks a lot. So indeed it looks like one needs two passes to separate bitmaps and vector stuff. I could not find the original code by Chris; thanks for that link. Using his filter-obs.ps with either `-dFILTERIMAGE` or `-dFILTERFILL -dFILTERSTROKE -dFILTERSHOW` achieves almost exactly what I need. -- Almost, because now I'm lacking a way to tell GS to not reencode/recompress the bitmaps. But that'll be the point of another question, I guess... – akobel May 08 '15 at 07:03
  • If you want the different objects in different files, then yes, you will need two (or more) passes. As regards compression, its in the documentation..... – KenS May 08 '15 at 07:49
  • The two passes are no problem. With respect to the compression, I could not find anything. IIUC, GS _always_ interprets the images, and a simple passthrough seems not to be available (e.g. [Kurt's info here](http://superuser.com/questions/360216/use-ghostscript-but-tell-it-to-not-reprocess-images#answer-373740)). At least the jbig2 -> ccitt-conversion is not too bad, but also JPEGs are recompressed. `-dAutoFilterColorImages=false` did not help there, either... – akobel May 08 '15 at 11:43
  • [standard lecture, again...] GS always interprets the input fully, and converts that into a series of marking operations. These operations are then fed to a 'device' which deals with them. In the case of a rendering device (eg TIFF) the marks are rendered to a bitmap. In the case of a high level device, they are emitted as a high level operation in the marking language. There are advantages and disadvantages to this. An advantage is that you can pull the kind of trickery you are taking advantage of to elide objects. A disadvantage is that theoutput is not the same as the input. – KenS May 08 '15 at 13:01
  • You can't have it both ways :-) The pdfwrite device does not know anythig about images as they are presented to the input, and therefor it does not know that they were JPEG compressed. So if you don;t want it to apply JPEG compression then you need to tell it not to. Simply telling it not to Auto filter the images won't help, because the default is JPEG. You need to set ColorImageFilter and also potentially Gray and Mono filters.] – KenS May 08 '15 at 13:03
  • Thanks for the confirmation, it's what I expected. And yes, I'm thankful for what GS does... ;-) – akobel May 19 '15 at 09:23
  • I believe I still have an open feature request to not decompress JPEG images, but it needs work at both the interpreter front end, and PDF output back end, and I haven't found the time to even look at what's involved yet. – KenS May 19 '15 at 12:44