1

My task is to bulk convert searchable PDFs into image-only PDFs. I have found out that I can do it with Ghostscript using fist pdf2ps (to convert PDF to PS) and then ps2pdf (to convert PS to PDF).

I installed gs920w32.exe on Windows 10.

pdf2ps works perfectly for me:

pdf2ps "directory\input.pdf" "directory\output.ps"

But ps2pdf simply does nothing:

ps2pdf "directory\output.ps" "directory\output.pdf"

I also noticed that if I execute pdf2ps without parameters, I get

"Usage: pdf2ps [-dASCII85DecodePages=false] [-dLanguageLevel=n] input.pdf output.ps"

But if I execute ps2pfd without paramenets I also get nothing.

What do I do wrong?

UPD: "Image-only" PDF looks the same as "searchable" PDF but you cannot search in it, thus you can also call it just "non-searchable" PDF.

SOLUTION: I solved my problem by executing this:

gswin64c -o "directory\input.pdf" -dNoOutputFonts -sDEVICE=pdfwrite "directory\output.pdf"

2 Answers2

0

The first thing you are doing wrong is expecting this double conversion to produce PDF files which only contain images, it won't. The relevant Ghostscript devices, ps2write and pdfwrite go to considerable lengths to preserve vector information. About the only time you will definitely get an image in the output that wasn't in the input is when the input PDF file contains transparency, because that cannot be preserved into PostScript.

The second thing you are doing wrong is using the scripts. THey are not designed for what you are trying to do. Use the Ghostscript executable instead and write any scripts you need yourself.

Since your (mad) task is to convert a PDF to an image, and then wrap that back up as a PDF, you will want to use one of the Ghostscript rendering devices to render an image for you, and then use one of the view*.ps files to read that image back and write the output through the pdfwrite device.

You won't want to use JPEG as the multiple quantisations will badly affect the image quality. I'd suggest using RAW.

KenS
  • 30,202
  • 3
  • 34
  • 51
  • Thansk for you response. I think you understood me wrong. The "image-only PDF" is not only images from the original PDF. It is exactly the same PDF, with its original texts, but you cannot search in it. I need to produce image-only PDFs in order to do OCR on them, which often fails for searchable PDFs. There is an example how some other people do it: http://stackoverflow.com/questions/9106959/converting-searchable-pdf-to-a-non-searchable-pdf – Ekaterina Ermilova Sep 30 '16 at 12:27
  • That example uses the pswrite device, which has been removed from the Ghostscript distribution as it has been superseded by the far superior ps2write device. Which, as I say, goes to some length to maintain the text as text which means it may well be 'searchable'. In any event the scripts you are using are not intended to deal with directories of filesou would be better to write your own shell script for that. – KenS Oct 01 '16 at 02:09
0

I solved my problem by executing this: gswin64c -o "directory\input.pdf" -dNoOutputFonts -sDEVICE=pdfwrite "directory\output.pdf"

  • You have input.pdf and output.pdf the wrong way round in that command line. -o specifies the output file and the input file comes last. – KenS Oct 02 '16 at 12:08