2

I want to convert a pdf file to png to manipulate within Python, and the save it back as a pdf, but in the process a grey zone gets created around the fonts (my image is a simple black and white typed document). It's very faint, a bit hard to see on a screen, but when printed it becomes fairly visible.

Here's the specific command I use: PDF to PNG (in greyscale, super-sampling to preserve image quality):

convert -density 500 -alpha off file_in.pdf -scale 1700x2200 -bordercolor black -border 1x1 -fuzz 20% -trim +repage -colorspace Gray -depth 4 file_out.png

within Python

import Image 
img = Image.open('file_out.png')
img.save('file_out2.pdf')

I also tried converting pdf to png with Ghostscript:

gs -sDEVICE=png16m -sOutputFile=file.png -dNOPAUSE -dBATCH -r300 file_out.pdf 

with the save result.

Here's part of what

identify -verbose file.png

gives for the ImageMagick png :

 Format: PNG (Portable Network Graphics)
  Class: PseudoClass
  Geometry: 1700x2200+0+0
  Resolution: 500x500
  Print size: 3.4x4.4
  Units: Undefined
  Type: Grayscale
  Base type: Grayscale
  Endianess: Undefined
  Colorspace: Gray
  Depth: 8/4-bit
  Channel depth:
    gray: 4-bit

Anyone have a solution? or at least an explanation?

Edit: I found that using '-sample 1700x2200' instead of '-scale 1700x2200' fixed the grey around the fonts, but then the thin lines almost disappear and the font suffers from aliasing...

Tickon
  • 1,058
  • 1
  • 16
  • 25
  • "super-sampling to preserve image quality" basically means "adding a gray border to sharp black objects" to reflect sub-pixel positioning. Is this what you're seeing? – Ben Jackson Mar 30 '13 at 05:42
  • It's maybe 1/4 the size of the font, that seems way too big. And it's not visible in the PNG images. – Tickon Mar 30 '13 at 05:52

2 Answers2

2

The pdf format is basically a vector format that can also include bitmapped ("raster") images.

If the original pdf contains a scanned document, it will usually only contain a bitmapped image (often in tiff or jpeg format) and then converting it to png is fine (if you stick to the original resolution of the image).

But if the original contains vector graphics (including text strings), converting those to a bitmap will generally introduce sampling errors. To avoid those, you canuse 1-bit color depth ("black-and-white" format) and a resolution that at least matches the printer. This will produce quite a large file png file, though. Using the tiff format might yield a smaller file. The "tiff-inside-pdf" format is something you see often when large drawings are scanned. According to ImageMagick's identify program, such a tiff file looks something like this:

  Format: TIFF (Tagged Image File Format)
  Class: DirectClass
  Geometry: 13231x9355+0+0
  Resolution: 400x400
  Print size: 33.0775x23.3875
  Units: PixelsPerInch
  Type: Bilevel
  Base type: Bilevel
  Endianess: MSB
  Colorspace: Gray
  Depth: 1-bit
  Channel depth:
    gray: 1-bit

Dispite the huge size, the tiff file is only 144 kb. The tiff2pdf program (part of the tiff package) can convert these to nice and small pdf files.

But the best way to preserve the document's format is to edit the pdf file itself, instead of converting it to another format.

There is a Python module for manipulating pdf documents; PyPDF2. But since you don't specifiy what you want to do with the document, it is impossible to say if this can do what you want. There is also ReportLab, but that's more for generating pdf files. If you have the cairo library installed on your system, pycairo is a less heavyweight option to generate pdf documents.

An excellent utility in general for manipulating pdf files is pdftk (written in java).

Edit: Sampling in grayscale will always introduce sampling artefacts. These are not errors in themselves, just a consequence of the sampling process.

Decompiling the pdf file into PostScript as Ben Jackson mentions can be done. There are a couple of utilities that can help you with that; pdftops from the poppler-utils package, and pdf2ps that comes with ghostscript. In my experience, pdftops tends to produce better usable output.

But I haven't found a good way to automate this process. Below is a fragment from the Numpy User Guide decompiled with pdftops:

(At)
[7.192997
0
2.769603
0] Tj
-314 TJm
(the)
[2.769603
0
4.9813
0
4.423394
0] Tj
-313 TJm
(core)
[4.423394
0
4.9813
0
3.317546
0
4.423394
0] Tj
-314 TJm
(of)
[4.9813
0
3.317546
0] Tj
-313 TJm
(the)
[2.769603
0
4.9813
0
4.423394
0] Tj
-314 TJm
(NumPy)
[7.192997
0
4.9813
0
7.750903
0
5.539206
0
4.9813
0] Tj
-314 TJm
(package,)
[4.9813
0
4.423394
0
4.423394
0
4.9813
0
4.423394
0
4.9813
0
4.423394
0
2.49065
0] Tj
-329 TJm

This produces the sentence "At the core of the Numpy package," So if you look into the PostScript file for anything between (), you'll get the strings.

So changing individual words or removing short pieces is not that hard;

  • Find the correct word(s) in the decompiled PostScript.
  • Edit them (and the surrounding parameters!)
  • Re-compile to pdf (with ghostscript).

But you would have to look into the beginning of the document and see what the functions Tj and TJm do. If you want to replace text, you'll have to remove them and put in new text and code with the correct parameters for Tj and TJm. This requires an understanding of PostScript. And if you are replacing a sentence, you usually cannot replace it with a longer sentence; there will not be enough space...

Therefore it is generally advisable to try and get the original application to change the output.

Roland Smith
  • 42,427
  • 3
  • 64
  • 94
  • Thanks, I indeed seem to experience sampling error of fonts when I super-sample the image. Is there no way to get a good sampling in greyscale? What I want to do is open the file with PIL, to add some text and overlay an image. – Tickon Mar 30 '13 at 15:45
0

Is there no way to get a good sampling in greyscale? What I want to do is open the file with PIL, to add some text and overlay an image

A PDF is a compressed PostScript document (plus metadata). PostScript is a programming language. If you use pdf2ps you can then add code to the PostScript to draw over any existing parts of the PDF. Then convert back with pdf2ps.

Here's another question that deals with that idea directly: Is it possible in Ghostscript to add watermark to every page in PDF

Community
  • 1
  • 1
Ben Jackson
  • 90,079
  • 9
  • 98
  • 150