33

Is there any easy (scriptable) way to convert a PDF with vector images into a PDF with raster images? In other words, I want to generate a PDF with the exact same (un-rasterized) text but with each vector image replaced with a rasterized version.

I occasionally read PDFs of technical articles on my Kindle, and have found that reading a PDF directly is frustrating. Thankfully, Amazon's automatic conversion of PDFs to the Kindle format does a good job of reflowing the text portions of most of PDFs I have tried. However, while raster images seem to make it through the conversion process fine, vector images get horribly mangled. It would be great if I could easily convert a PDF so that all of its vector images were rasterized.

I am interested in any possible solutions, but a Linux- or Windows-based one would be preferable.

Michael Boyer
  • 1,005
  • 1
  • 10
  • 9
  • Note: this question was originally [posted](http://tex.stackexchange.com/questions/47076/replacing-vector-images-in-a-pdf-with-raster-images) at the [TeX site](http://tex.stackexchange.com/), but the mods there suggested I ask it here instead. – Michael Boyer Mar 07 '12 at 19:22
  • You can export all pages to images and then create a PDF using those images. There are lots of applications that can do this. I think a combination of imagemagick and/or ghostscript would do. For programmers, I have written article titled "How To Rasterize A PDF Document In .NET," which shows how to do this using our PDFOne .NET product. – BZ1 Mar 08 '12 at 05:00
  • 1
    But I only want to rasterize the images/figures in the PDF, not the text. I don't see any way to do this using ImageMagick. I'll take a look at Ghostscript. – Michael Boyer Mar 08 '12 at 16:18
  • 1
    @MichaelBoyer Unless you are asking for a solution for a given framework (e.g. .NET, JAVA, Windows, Linux), this question seems more suitable for the SuperUser site than for StackOverflow. – Danny Varod Feb 03 '13 at 16:43

9 Answers9

19

I had a similar issue, and solved it using ImageMagics convert tool (http://www.imagemagick.org/script/index.php). That comes with linux and runs fine on Windows/Cygwin or OS X

convert -density 300 largeVectorFileFromR.pdf out.pdf

With -density 300 you control resolution (as DPI).

Downside: Text is rasterized as well, I understand that Michael does not want this.

vertikalist
  • 607
  • 6
  • 5
  • 4
    Users encountering a `no images defined` error will need to install the required ghostscript `gs` dependency. For MacOS users with Homebrew: `brew install ghostscript` – Mark Egge Sep 19 '17 at 19:21
  • This is the solution that works for me. I also need to have ghostscript on Windows. – Mu-Tsun Tsai Apr 11 '22 at 02:17
13

After some days searching for some solution, based on "Remove all text from PDF file" and "How to add a picture onto an existing pdf file?" I found a (ugly) scriptable solution:

gs -o /tmp/onlytxt.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE $INPUT_FILE && \
gs -o /tmp/graphics.pdf -sDEVICE=pdfwrite -dFILTERTEXT $INPUT_FILE && \
convert -density $DPI -quality 100 /tmp/graphics.pdf /tmp/graphics.png && \
convert -density $DPI -quality 100 /tmp/graphics.png /tmp/graphics.pdf && \
pdftk /tmp/graphics.pdf stamp /tmp/onlytxt.pdf output $OUTPUT_FILE && \
rm /tmp/onlytxt.pdf /tmp/graphics.pdf /tmp/graphics.png

were we have three variables INPUT_FILE, OUTPUT_FILE, and DPI. We split the textual and graphical contents via Ghostscript, convert the graphical image to a raster image (PNG) and join the two using pdftk.

I've been using this successfully to convert huge vector images for use in scientific papers.

Civ Lins
  • 131
  • 1
  • 2
  • For more recent versions of ImageMagick, such operations on PDF are forbidden by default so ``/etc/ImageMagick-7/policy.xml`` must be edited (see [here](https://bugs.archlinux.org/task/60580)) – Silmathoron Apr 15 '19 at 22:50
  • 1
    also for people who might want to do this for a multipage pdf, ``pdftk`` offers an equivalent ``multistamp`` option – Silmathoron Apr 16 '19 at 13:23
2

Pitstop Pro v2 update 3 from Enfocus can do exactly that. It has an action called "Rasterize page content, keeping text" which works pretty well. It is a plugin to Adobe Acrobat so it requires a little more but is also available as a server solution.

fltman
  • 135
  • 1
  • 2
  • 3
    Welcome to stackoverflow. Above post might answer the question.But little more explanation might help out the fellow programmers to understand how it works. – Nagama Inamdar Nov 14 '14 at 10:41
1

It's a little complicated, but you asked for any possible solution. Furthermore this solution is not automatable.

1) Open the pdf with the vector images in Inkscape. Then select the whole image with the select tool (F1)

2) If the vector image is consistant of more than one svg graphic press Ctrl + G (Object --> Group)

3) cut the grouped svg image Ctrl + x

4) open a new InkScape Window Ctrl + n and paste the image Ctrl + v

5) choose File --> export Bitmap (Shift + Ctrl + e), maybe you want to increase the dpi

6) go back to the first InkScape window, File --> import (Ctrl + i) and choose the previously exported bitmap

7) place the bitmap to the location where the svg image was

Save the pdf and the vector image is replaced by a bitmap image.

Martin Grohmann
  • 437
  • 2
  • 17
  • 1
    Very complicated and work intensive. I am looking for a more automated version and thought that such a script should exist somewhere. – data Feb 07 '13 at 16:42
  • Yes I figured that you need a scriptable way. But I thought after 11 months without a single answer, I share a possible way, at least. – Martin Grohmann Feb 07 '13 at 20:36
  • I understand the OP wanted something automatable, but thanks for sharing this answer - I found it a useful suggestion for a case where there is just one problematic image – Mike M Dec 30 '22 at 16:13
1

Here's one way to solve your problem:

Step 1: Use an online PDF-to-HTML converter, like the one here:

http://www.idrsolutions.com/online-pdf-to-html5-converter/

This tool converts the PDF into a set of images and a text overlay. The vector images should be converted to raster at this point.

Step 2: Convert the HTML+images back into PDF:

http://pdfcrowd.com/#convert_by_upload+with_options

The resulting PDF will have all the vector images rasterized, and all text will remain text, so you can select, copy, etc.

Hari
  • 1,056
  • 1
  • 10
  • 13
  • Problem for me is that for many pdfs, pdf2html is not able to parse the pdf correctly, thus making this inefficient. – data Feb 07 '13 at 16:41
  • Another problem is that text _within_ figures should be rasterized along with the rest of the figures; for example, think of the labels on the axes of a graph. This solution (pdf2html) leaves that text as text, so the resulting rasterized figure is incomplete. – Michael Boyer Feb 07 '13 at 18:01
  • Also, it is unclear how you would use this for a PDF with more than one page. – Michael Boyer Feb 07 '13 at 18:01
  • pdf2html is based on xpdf, so it's less capable than some of the more recent PDF libraries. I'd encourage you to download (or try the online version of) the JPedal PDF-to-HTML converter linked to in the answer. It allows the generation of a single HTML file for multiple pages. Also, could you attach a sample PDF to the question? I work with PDF a fair bit and might be able to come up with something. (No affiliation with the sites linked to above.) – Hari Feb 08 '13 at 04:07
1

Convert the pdf to djvu with https://jwilk.net/software/pdf2djvu converter. Uncheck "antialias fonts,vectors..". It will reduce file size significantly and improve document load times.

user260396
  • 11
  • 1
  • 2
0

I used the following:

gswin32c -o "%2" -dFirstPage=1 -dLastPage=1 -sDEVICE=pngalpha -r72x72 -dUseCropBox -dFitPage "%1" -dBATCH -dNOPAUSE

where %1 is the input file and %2 is the output. This can be used with LaTeX, the generated PNG has the same ratio and page size as the original PDF so the relative position of the image will not change.

Note that in Linux, you may need to use gs rather than gswin32c.

You can also set the page range and then print the pages back to PDF. The downside is that the text gets rasterized as well.

the swine
  • 10,713
  • 7
  • 58
  • 100
0

inkscape is the best solution, I quickly made this rather unoptimized batch file that does exactly that and you can play with it and change options. ImageMacick convert, gs, or pdftoimages don't work as good as inkscape they either don't export the layers or export but with bad quality :

#!/bin/bash
#set -xev
ORIGINAL_FOLDER=`pwd` 
JPEGS=`mktemp -d`
unzip "$1" -d "$JPEGS"
cd "$JPEGS"
# expang the pdf in pdf pages
pdftk combined_to_do.pdf burst output pg_%04d.pdf
#1) print the pdf's to pngs as they are seen with alpha, layers, transparency etc, this cannot be done by ImageMacick convert or pdftoimages
ls ./pg*.pdf | xargs -L1 -I {}  inkscape {} -z --export-dpi=300 --export-area-drawing --export-png={}.png
#2) Second change to jpgs
rm *.pdf
ls ./p*.png | xargs -L1 -I {} convert {}  -quality 100 -density 300  {}.jpg
#3) This to make a pdf file out of every jpg image without loss of either resolution or quality:
ls -1 ./*jpg | xargs -L1 -I {} img2pdf {} -o {}.pdf
#4) This to concatenate the pdfpages into one:
pdftk *.jpg.pdf cat output combined.pdf
#5) And last I add an OCRed text layer that doesn't change the quality of the scan in the pdfs so they can be searchable:
pypdfocr combined.pdf
cp "$JPEGS/combined_ocr.pdf" "$ORIGINAL_FOLDER/$1_ocr.pdf"
cp "$JPEGS/combined.pdf" "$ORIGINAL_FOLDER/$1.pdf"
Eduard Florinescu
  • 16,747
  • 28
  • 113
  • 179
0

Based on Civ Lins solution, I came up with this:

#!/usr/bin/env sh
gs -o /tmp/onlytxt.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE $1 && \
gs -o /tmp/graphics.pdf -sDEVICE=pdfimage24 -dFILTERTEXT -r600 -dDownScaleFactor=6 $1 && \
pdftk /tmp/graphics.pdf multistamp /tmp/onlytxt.pdf output $2 && \
rm /tmp/onlytxt.pdf /tmp/graphics.pdf

(In contrast to the previous solution, it handles multipage PDFs and uses gs to directly render the rasterized image without the detour of convert.)

moi
  • 1,835
  • 2
  • 18
  • 25