2

I have an low-resolution black & white image screenBWsmall.png:

enter image description here

I use the Python Imaging Library to convert it to PDF:

#!python
from PIL import Image 
im = Image.open('screenBWsmall.png')
im.save('screenBWsmall.pdf')

The PDF file is huge compared to one generated from ImageMagick's convert, issued from the Bash command line:

convert screenBWsmall.png screenBWsmall_IM.pdf

The file sizes are:

  11093 screenBWsmall.png
1050994 screenBWsmall.pdf
  16999 screenBWsmall_IM.pdf

While I'm puzzled by this, it is even more puzzling considering that the larger file screenBWsmall.pdf uses 1 bit per pixel (bits per component, or bpc) compared to 8 bpc for the smaller file screenBWsmall_IM.pdf:

$ pdfimages.exe -list screenBWsmall.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     960   540  gray    1   1  image  no         1  0    72    72 1025K 1621%

$ pdfimages.exe -list screenBWsmall_IM.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
-------------------------------------------------------------------------------------------- 
   1     0 image     960   540  gray    1   8  image  no         8  0    72    72 14.9K 2.9%

The Image.save documentation doesn't give much information with which I can speculate on the reason for the large file size.

Why does PIL create such a large PDF file size?

Is there any way to have it create the smaller size of ImageMagick's convert? I want to do it in Python because I will be performinng more complex steps with many files.

My Python version is:

Python 3.8.8 (default, Mar  4 2021, 21:24:42) 
[GCC 10.2.0] on cygwin

Investigations with ImageMagick's convert

Thanks to fmw42's suggestions, I systematically experimented with 3 ways to shrink and combine 2 JPG images to into 1 PDF file. In order of decreasing file size, the 3 methods are as follows.

Method #1: Use Python's PIL to generate IMG_077x_PIL.pdf (see jpg2pdf.py below). In the process of doing so, save shrunken versions of both images to separate PNG files for Method #3.

Method #2: Use ImageMagick's convert to generate IMG_077x_IMcvt.pdf:

convert -sample 50% -type Bilevel +dither IMG_077[45].JPG IMG_077x_IMcvt.pdf

Method #3: Apply convert to shrunken PNG files from PIL to generate IMG_077x_PIL+IMcvt.pdf:

convert IMG_077[45]small.png IMG_077x_PIL+IMcvt.pdf

Output PDF file sizes (in the same order as the methods above):

12350481 IMG_077x_PIL.pdf
 1234076 IMG_077x_IMcvt.pdf
  149782 IMG_077x_PIL+IMcvt.pdf

The 2 input JPG files sizes are a few MBs:

 2526685 IMG_0774.JPG
 2699515 IMG_0775.JPG

The 2 intermediate PNG file sizes used in Methods #1 and #3 are few dozen KBs:

   67283 IMG_0775small.png
   61968 IMG_0774small.png

Observations:

  • Method #1: Great for shrinking the images down, but really bad in generating an enormous PDF file that is two orders of magnitude larger than it has to be.

  • Method #2: Middle of the road, most convenient, but PDF file size is an order of magnitude larger than it has to be.

  • Method #3: Requires both Python, PIL, and convert. It is the least convenient, but most byte efficient. The resulting PDF is only slightly larger than the sum of the two PNG images.

I wish there was a way to make Methods #1 and/or #2 as good as Method #3.

Characteristics of the output PDF files

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
$pdfimages -list IMG_077x_PIL.pdf
   1     0 image    1512  2016  gray    1   1  image  no         1  0    72    72 6030K 1621%
   2     1 image    1512  2016  gray    1   1  image  no         4  0    72    72 6030K 1621%

$pdfimages -list IMG_077x_IMcvt.pdf
   1     0 image    2016  1512  gray    1   8  jpeg   no         8  0    72    72  576K  19% 
   2     1 image    2016  1512  gray    1   8  jpeg   no        22  0    72    72  626K  21% 

$pdfimages -list IMG_077x_PIL+IMcvt.pdf
   1     0 image    1512  2016  gray    1   8  image  no         8  0    72    72 68.9K 2.3% 
   2     1 image    1512  2016  gray    1   8  image  no        22  0    72    72 74.6K 2.5% 

jpg2pdf.py

#!python

# jpg2pdf.py
#-----------
# Use PIL to subsample, rotate, and convert 2 JPGs to B&W.
# Save each to small PNGs.
# Combine both into a PDF.

import os
from PIL import Image

ims = [] # Stores the 2 images
fns=('IMG_0774.JPG','IMG_0775.JPG') # Filenames of the 2 images

for fn in fns:
   
   # Read, resize, rotate, convert to B&W, add to list `fns`
   im = Image.open(fn)
   im = im.resize((im.width//2, im.height//2))
   im = im.rotate(-90,expand=True)
   im = im.convert(mode="1", dither=Image.NONE)
   ims.append(im)

   # Write IMG_077[45]small.png
   fnBase = os.path.splitext(fn)[0]
   im.save( fnBase+'small.png' )

# Write both to a single PDF
ims[0].save( 'IMG_077x_PIL.pdf' , save_all=True , append_images=ims[1:] )

A test input file

This dummy JPEG image file should be save-able as both IMG_0774.JPGand IMG_0775.JPG. Methods #1 through #3 should then work exactly as described with the code posted above. Using this JPG image, I confirmed that the 3 output file sizes are almost the same as reported in my question. Being just over 2MB, unfortunately, it can't be uploaded to this posted question.

user36800
  • 2,019
  • 2
  • 19
  • 34
  • You could use Python Wand, which use Imagemagick. – fmw42 Apr 17 '21 at 20:55
  • I'm actually trying to avoid ImageMagick because I'm doing other things too, like rotating, subsampling, turning smartphone photos to black & white. I haven't been able to figure out how to get ImageMagick to do that while benefiting from smaller file size ([here](https://superuser.com/questions/1641106/imagemagick-convert-smartphone-jpg-to-fax-quality-document)). I guess I could generate the smaller PNG with PIL, then mess around with Python Wand to get it to read the PNG and `convert` to PDF. Definitely an option on the table, but it would be so nice to do it with one package. – user36800 Apr 17 '21 at 21:05
  • Python Wand should be able to do all your processing that you can do with PIL and more. – fmw42 Apr 17 '21 at 21:23
  • I'm looking into it now. If it isn't part of Cygwin, I will learn the package setup outside of Cygwin, which I fear may cause problems whenever I upgrade Cygwin. But only one way to find out. The real issue is that I've done those things in ImageMagick from the Bash command line, starting from original smartphone photos, and the PDFs are enormous. It's only if I use PIL to create a subsampled B&W PNG and use `convert` *only* to convert from PNG to PDF that the size inflation doesn't occur much, as shown in my originally posted question. – user36800 Apr 17 '21 at 21:28
  • In fact, before venturing into Python and discovering PIL, I was resorting to using Octave or Matlab to subsample, create B&W, and export to PDF. I didn't complete those courses of investigation because I want to learn Python. – user36800 Apr 17 '21 at 21:33
  • In Imagemagick, use -sample not -resize on your B&W image to downsample and avoid interpolating new (gray) colors that will inflate your file size. Also set your -type bilevel on the B&W file if necessary. – fmw42 Apr 17 '21 at 21:49
  • @fmw42: Thanks for those parameters! I systematically compared `convert` with PIL and a combination of PIL+`convert`. The results are added to my original question. Using `convert` alone is the most convenient, but the output PDF file size is an order of magnitude larger than necessary. Better than using PIL, which inflates the bytes by 2 orders of magnitude. The best method, however, combines PIL with `convert`, but is not practical without more elaborate scripting to combine the two. If only `convert` can create an output PDF as small as `convert`+PIL, or if PIL can be made to do this. – user36800 Apr 18 '21 at 00:07
  • What was your Imagemagick command and can you post your input file? – fmw42 Apr 18 '21 at 15:50
  • @fmw42: The ImageMagick command is in the posted question under *Method #2* and *Method #3*. I uploaded a sample file [here](https://drive.google.com/file/d/1vFKJzN4fn0qKfNRQg6R4cRLgKnPJvy2l/view?usp=sharing), which can be saved to *IMG_0774.JPG* and to *IMG_0775.JPG*. Methods #1 through #3 should work exactly as described with the code posted in the question. Using this JPG image, I confirmed that the 3 output file sizes are almost the same as reported in my question. P.S. Running into [showstoppers](https://stackoverflow.com/questions/67145912/python-wand-missing-libraries-paths) with Wand – user36800 Apr 18 '21 at 16:29
  • I cannot access your file. I expect you will hear back shortly from Eric McConville about Wand – fmw42 Apr 18 '21 at 17:28
  • @fmw42: Woh! Thanks for pointing out your inaccessibility to the JPG on Google Drive. I rarely use it, and have just now learned that one has to explicitly make the document accessible by the public. Fixed now. It also just occurred to me that I can add the JPG file to the posted question, will now do so. – user36800 Apr 18 '21 at 17:32
  • This command produces a file at 2566 B. `convert IMG_0774.JPG -sample 50% -compress fax -type bilevel out1.pdf`. The issue is likely that it is done on a Q16 Imagemagick system and the output is 16/1-bit file. If I was using a Q8 Imagemagick, it could be half the size. PIL is likely outputting a file at 8-bits. So it is half the size. – fmw42 Apr 18 '21 at 17:46
  • @fmw42: I think you're on to something. I can't get 2.5KB like you, but get 75KB for a 2-page PDF -- half the size of the best method (#3 above). I do have to put all parameters before the input/output files for `+dither` to apply: `convert IMG_0774.JPG IMG_0775.JPG -sample 50% -compress fax -type bilevel +dither out1.pdf`. The encoding shows as `ccitt` using `pdfimages -list`, 1 bit per component (bpc), though `convert -version` shows Q16 for quantum depth. *Did you want to post this as the answer?* – user36800 Apr 18 '21 at 19:02
  • P.S. Your explanation that PIL uses 8 bits might not be the explanation in a way that I can understand. Only method #1 in the posted question uses PIL to generate the PDF, and it's the largest. – user36800 Apr 18 '21 at 19:06
  • +dither does nothing that I know in your command. It is only for drawing text in an image or in use with color quantization. It would always precede the command that uses that setting. – fmw42 Apr 18 '21 at 19:55
  • @fmw42: For me, it got rid of speckle in the output PDF. – user36800 Apr 18 '21 at 21:30

2 Answers2

2

Why does PIL create such a large PDF file size?

Unfortunately, Pillow re-encodes black & white image by DCTDecode (that is used for JPEG):
https://github.com/python-pillow/Pillow/blob/0f44136e720cd3b2db72bdf29614897b7aa3e868/src/PIL/PdfImagePlugin.py#L127

$ pdfimages -list screenBWsmall.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     960   540  gray    1   8  jpeg   no         1  0    72    72 93.1K  18%

You can use img2pdf without re-encode:

from pathlib import Path

import img2pdf

Path("screenBWsmall.pdf").write_bytes(img2pdf.convert("screenBWsmall.png"))
$ pdfimages -list screenBWsmall.pdf 
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     960   540  gray    1   1  image  no         7  0    96    96 10.8K  17%

I tested by:

  • Python: 3.10.4
  • Pillow: 9.1.1
  • img2pdf: 0.4.4
  • pdfimages: 20.09.0
Yukihiko Shinoda
  • 805
  • 2
  • 8
  • 21
  • Thanks, Yukihiko. I upvoted your answer but haven't marked it as an answer because I haven't yet tried it. The last time I tried to install a Python package outside of the host system (Cygwin), I was not able to make it work. So I need a window of time to give your solution a try. On a separate but possibly related note, Cygwin has a C++ library with an `img2pdf` utility under the PoDoFo package. For me, it may be a more direct root to access that function, but it's unclear whether it is the same `img2pdf`. – user2153235 Jun 13 '22 at 18:33
0

I think it's due to the mode of your image and PIL feeling constrained to retaining it as a bi-level image.

If you do this:

im = Image.open('lorem.png')

# Check type of image - it is bi-level, i.e. mode=1
print(im)
<PIL.PngImagePlugin.PngImageFile image mode=1 size=960x540 at 0x7F9E08A65100>


# Save and check size
im.save('lorem.pdf')

# -rw-r--r--     1 mark  staff  1050978 18 Apr 10:01 lorem.pdf   <--- YIKES

If you tell PIL it's ok to treat it as colour, it works fine:

im = Image.open('lorem.png').convert('RGB')
im.save('lorem.pdf')

# -rw-r--r--     1 mark  staff    99556 18 Apr 10:02 lorem.pdf  <--- THAT'S BETTER

There is a clue in the documentation here where it says the way it is written depends on the mode and the availability of the JPEG encoder.

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • Thanks for that. I did happen on the `save` page, but didn't derive enough information there to mention it. Re-looking, the image format is either JPG or HEX encoding. My doctoral thesis was in the area of digital system design, and there wasn't enough detail about HEX encoding to get an idea of what happens. I also haven't found an image standard based on "HEX" alone. I vaguely recally that PDF can using an internal image format that isn't based on an external standard. – user36800 Apr 18 '21 at 13:41
  • To test your suggestion to remove B&W, I commented out the `im.convert()` command in `jpg2pdf.py` above. The resulting PDF was 33% smaller than *Method #2* above (the most convenient one), but 5x bigger than *Method #3* (most byte-efficient). Thanks, it helps a bit. – user36800 Apr 18 '21 at 13:41
  • You might try `convert('L')` on your system too. You can also check if your PIL has JPEG or turbo-JPEG feature with `python -m PIL` – Mark Setchell Apr 18 '21 at 13:51
  • Thanks, that decreased the PDF file size by 4.6%. For all cases tried without `mode="1"`, `pdfimages -list` shows `jpeg` encoding. Honestly, I'm a bit wary of JPEG because, at least based on ImageMagick, any changes cause a re-encoding, which reduces fidelity. With `mode="1"` (Method #1 above), `pdfimages -list` shows 1 bit per component, so whatever "HEX" format is being used to generate the PDF must not be exploiting that. The most byte-efficient PDF above (Method #3) does not show `jpeg` encoding, but it relies on ImageMagick's `convert` to generate the PDF rather than PIL's `save`. – user36800 Apr 18 '21 at 14:03