The pdf
format is basically a vector format that can also include bitmapped ("raster") images.
If the original pdf
contains a scanned document, it will usually only contain a bitmapped image (often in tiff
or jpeg
format) and then converting it to png
is fine (if you stick to the original resolution of the image).
But if the original contains vector graphics (including text strings), converting those to a bitmap will generally introduce sampling errors. To avoid those, you canuse 1-bit color depth ("black-and-white" format) and a resolution that at least matches the printer. This will produce quite a large file png
file, though. Using the tiff
format might yield a smaller file. The "tiff-inside-pdf" format is something you see often when large drawings are scanned. According to ImageMagick's identify
program, such a tiff
file looks something like this:
Format: TIFF (Tagged Image File Format)
Class: DirectClass
Geometry: 13231x9355+0+0
Resolution: 400x400
Print size: 33.0775x23.3875
Units: PixelsPerInch
Type: Bilevel
Base type: Bilevel
Endianess: MSB
Colorspace: Gray
Depth: 1-bit
Channel depth:
gray: 1-bit
Dispite the huge size, the tiff
file is only 144 kb. The tiff2pdf
program (part of the tiff
package) can convert these to nice and small pdf
files.
But the best way to preserve the document's format is to edit the pdf
file itself, instead of converting it to another format.
There is a Python module for manipulating pdf
documents; PyPDF2. But since you don't specifiy what you want to do with the document, it is impossible to say if this can do what you want. There is also ReportLab
, but that's more for generating pdf files. If you have the cairo
library installed on your system, pycairo is a less heavyweight option to generate pdf
documents.
An excellent utility in general for manipulating pdf
files is pdftk (written in java).
Edit: Sampling in grayscale will always introduce sampling artefacts. These are not errors in themselves, just a consequence of the sampling process.
Decompiling the pdf
file into PostScript as Ben Jackson mentions can be done. There are a couple of utilities that can help you with that; pdftops
from the poppler-utils package, and pdf2ps
that comes with ghostscript. In my experience, pdftops
tends to produce better usable output.
But I haven't found a good way to automate this process. Below is a fragment from the Numpy User Guide decompiled with pdftops
:
(At)
[7.192997
0
2.769603
0] Tj
-314 TJm
(the)
[2.769603
0
4.9813
0
4.423394
0] Tj
-313 TJm
(core)
[4.423394
0
4.9813
0
3.317546
0
4.423394
0] Tj
-314 TJm
(of)
[4.9813
0
3.317546
0] Tj
-313 TJm
(the)
[2.769603
0
4.9813
0
4.423394
0] Tj
-314 TJm
(NumPy)
[7.192997
0
4.9813
0
7.750903
0
5.539206
0
4.9813
0] Tj
-314 TJm
(package,)
[4.9813
0
4.423394
0
4.423394
0
4.9813
0
4.423394
0
4.9813
0
4.423394
0
2.49065
0] Tj
-329 TJm
This produces the sentence "At the core of the Numpy package," So if you look into the PostScript file for anything between (), you'll get the strings.
So changing individual words or removing short pieces is not that hard;
- Find the correct word(s) in the decompiled PostScript.
- Edit them (and the surrounding parameters!)
- Re-compile to pdf (with ghostscript).
But you would have to look into the beginning of the document and see what the functions Tj
and TJm
do. If you want to replace text, you'll have to remove them and put in new text and code with the correct parameters for Tj
and TJm
. This requires an understanding of PostScript. And if you are replacing a sentence, you usually cannot replace it with a longer sentence; there will not be enough space...
Therefore it is generally advisable to try and get the original application to change the output.