0

My workplace scanner creates exorbitantly large PDFs from low-resolution grayscale scans of hand-written notes. I currently use Acrobat Pro to extract PNG images from the PDF, then use Matlab to reduce the bit depth, then use Acrobat Pro to combine them back into PDFs. I can reduce the PDF file size by one to two orders of magnitude.

But is it ever a pain.

I'm trying to write scripts to do this, composed of cygwin command line tools. Here is one PDF that was shrunk using my byzantine scheme:

$ pdfimages -list bothPNGs.pdf

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     550   558  gray    1   2  image  no        25  0    72    72 6455B 8.4%
   2     1 image     523   519  gray    1   2  image  no         3  0    72    72 5968B 8.8%

I had used Matlab to reduce the bit depth to 2. To test the use of unix tools, I re-extract the PNGs using pdfimages, then use convert to recombine them to PDF, specifying a bit depth in doing so:

$ convert -depth 2 sparseDataCube.png asnFEsInTstep.png bothPNGs_convert.pdf
# Results are the same regardless of the presence/absence of `-depth 2`

$ pdfimages -list bothPNGs_convert.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     550   558  gray    1   8  image  no         8  0    72    72 6633B 2.2%
   2     1 image     523   519  gray    1   8  image  no        22  0    72    72 6433B 2.4%

Unfortunately, the bit depth is now 8. My bit depth argument doesn't actually seem to have any effect.

What would the recommended way to reduce the bit depth of PNGs and recombine into PDF? Whatever tool is used, I want to avoid antialiasing filtering. In non-photographic images, that just causes speckle around the edges of text and lines.

Whatever solution is suggested, it will be hit-or-miss whether I have the right Cygwin packages. I work in a very controlled environment, where upgrading is not easy.

This looks like another similar sounding question, but I really don't care about any alpha layer.

Here are two image files, with bit depths of 2, that I generated for testing:

enter image description here

enter image description here

Here are the tests, based on my initial (limited) knowledge, as well as on respondent Mark's suggestions:

$ convert -depth 2 test1.png test2.png test_convert.pdf
$ pdfimages -list test_convert.pdf

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     100   100  gray    1   8  image  no         8  0    72    72 3204B  32%
   2     1 image     100   100  gray    1   8  image  no        22  0    72    72 3221B  32%

$ convert -depth 2 test1.png test2.png -define png:color-type=0 -define png:bit-depth=2 test_convert.pdf
$ pdfimages -list test_convert.pdf

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     100   100  gray    1   8  image  no         8  0    72    72 3204B  32%
   2     1 image     100   100  gray    1   8  image  no        22  0    72    72 3221B  32%    

The bit depths of images within the created PDF file are 8 (rather than 2, as desired and specified).

user36800
  • 2,019
  • 2
  • 19
  • 34
  • What is "Adobe Pro"? I've been an Adobe developer in the past and have never heard of such a product. Do you have a link? – user1118321 May 26 '18 at 03:37
  • It's actually [Adobe Acrobat Pro](http://acrobat.adobe.com/us/en/acrobat/acrobat-pro.html). I will revise the original post. – user36800 May 26 '18 at 05:43
  • Couldn't help but notice the down-vote. A little explanation would be helpful (whoever did it). Thanks. – user36800 May 26 '18 at 05:44
  • Not sure I understand why you care... a) storage is cheap and getting cheaper and b) the files in your example are all around 6,000 bytes so why care whether they are 2bpc or 8bpc? – Mark Setchell May 26 '18 at 10:02
  • These are miniscule documents. Individually, I don't care. Cumulatively, they make a big difference, especially if scans are made routinely on documents large and small (paperless offices are now the aim). Furthermore, the impact on email client files are felt more severely by routinely attaching documents to email. For the miniscule example that I used to find a solution, the PDFs that are initially created are hundreds of KBs, while my recreated PDFs are several KBs. – user36800 May 26 '18 at 12:38
  • IT support have already been contacted -- they have no solution. The vendor community pages have already been posted to -- they have tried to be helpful, but their recommendations haven't yielded any difference. I am stuck doing this on a routine basis. Frankly, I'm fortunate that I even have this solution, as time consuming as it is. I don't have Matlab or Cygwin on the network where this is most needed, so in addition to the buttonology, emailing of bitmap files is needed. – user36800 May 26 '18 at 12:38
  • If you use LZW or ZIP (deflate) compression, the bit depth is irrelevant. What matters is the number of unique colors and the repetitiveness of data. I would bet that the file size wouldn’t change by more than a few bytes if you managed to store the PNG as 2-bit data rather than 8-bit data (assuming identical data). The 8-bit version, with all those redundant zero bits, is just easier to read into memory and manipulate. – Cris Luengo May 26 '18 at 14:33
  • I may not have access to the computer with Matlab today, but did some simple tests with Octave. I created 8-bit versions of the test PNGs, `convert`ed them to a PDF with `Zip` compression, and they showed 32% compression. Octave shows that the gray values are 4 equally spaced values between 0 and 255, as expected, since the *soure* PNGs were 2-bit. – user36800 May 26 '18 at 21:05
  • To sanity-check `convert` compression and `pdfimages` display of compression, I checked whether the 2-bit PNGs yield *no* compression when converted to PDF; `pdfimages` shows that they *also* compressed by 32%. This is nonsensical, since the values are random. I used command-line `zip` to confirm noncompressiblity of the 2-bit PNGs. I thought maybe `convert` internally works with 8-bits, and the 32% is relative to 8 bits. If so, then command-line `zip` should also show 32% for the 8-bit PNGs, but it showed 0%. The 2-bit PNGs are 2687 bytes while the 8-bit PNGs are 3297 bytes and 3311 bytes. – user36800 May 26 '18 at 21:05

2 Answers2

2

Thanks to Mark Setchell and Cris Luengo's comments and answers, I've come up with some tests that may reveal what is going on. Here are the 2-bit and 8-bit random grayscale test PNG's created using Matlab:

im = uint8( floor( 256*rand(100,100) ) );
imwrite(im,'rnd_b8.png','BitDepth',8);
imwrite(im,'rnd_b2.png','BitDepth',2);

The 2-bit PNGs have much less entropy than the 8-bit PNGs.

The following shell commands create PDFs with and without compression:

convert rnd_b2.png rnd_b2.pdf
convert rnd_b2.png -depth 2 rnd_b2_d2.pdf
convert rnd_b2.png -compress LZW rnd_b2_lzw.pdf
convert rnd_b8.png rnd_b8.pdf
convert rnd_b8.png -depth 2 rnd_b8_d2.pdf
convert rnd_b8.png -compress LZW rnd_b8_lzw.pdf

Now check file sizes, bit depth, and compression (I use bash):

$ ls -l *.pdf
 8096 rnd_b2.pdf
 8099 rnd_b2_d2.pdf
 7908 rnd_b2_lzw.pdf
22523 rnd_b8.pdf
 8733 rnd_b8_d2.pdf
29697 rnd_b8_lzw.pdf

$ pdfimages -list rnd_b2.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     100   100  gray    1   8  image  no         8  0    72    72 3178B  32%

$ pdfimages -list rnd_b2_d2.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     100   100  gray    1   8  image  no         8  0    72    72 3178B  32%

$ pdfimages -list rnd_b2_lzw.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     100   100  gray    1   8  image  no         8  0    72    72 3084B  31%

$ pdfimages -list rnd_b8.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     100   100  gray    1   8  image  no         8  0    72    72 9.78K 100%

$ pdfimages -list rnd_b8_d2.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     100   100  gray    1   8  image  no         8  0    72    72 3116B  31%

$ pdfimages -list rnd_b8_lzw.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     100   100  gray    1   8  image  no         8  0    72    72 13.3K 136%

Essentially, convert does not create PNGs of user-specified bit depths to put into PDFs; it converts 2-bit PNGs to 8-bit. This means that PDFs created from 2-bit PNGs have much less than entropy that the maximum for 8-bit images. I confirmed this by extracting the PNGs and confirming that there are only 4 grayscale levels in the data.

The fact that rnd_b8_d2.pdf is comparable in size to the PDFs created from 2-bit PNGs reveals how convert handles -depth 2 that precedes the output file specification. It seems that it does reduce dynamic range to 2 bits at some point, but expands it out to 8 bits for incorporation into the PDF.

Next, compare files sizes with their compression ratios, taking uncompressed 8-bit random grayscales as the baseline, i.e., rnd_b8.pdf:

rnd_b2.pdf       8096 / 22523 =  36%
rnd_b2_d2.pdf    8099 / 22523 =  36%
rnd_b2_lzw.pdf   7908 / 22523 =  35%
rnd_b8.pdf      22523 / 22523 = 100%
rnd_b8_d2.pdf    8733 / 22523 =  39%
rnd_b8_lzw.pdf  29697 / 22523 = 131%

It seems that the ratio from pdfimages is the amount of space taken by the image compared to a maximum entropy 8-bit image.

It also seems that compression is done by convert regardless of whether it is specified in the switches. This is from the fact that rnd_b2*.pdf are all of similar size and ratios.

I assume that the 31% increase of rnd_b8_lzw.pdf is overhead due to the attempt at compression when no compression is possible. Does this seem reasonable to "you" image processing folk? (I am not an image processing folk).

Based on the assumption that compression happens automatically, I don't need Matlab to reduce the dynamic range. The -depth 2 specification to convert will decrease the dynamic range, and even though the image is in the PDF as 8-bits, it is automatically compressed, which is almost as efficient as 2-bit images.

There is only one big concern. According to the above logic, the following files should all look comparable:

rnd_b2.pdf
rnd_b2_d2.pdf
rnd_b2_lzw.pdf

rnd_b8_d2.pdf

The first 3 do, but the last does not. It is the one that relies on the -depth 2 specification to convert to reduce dynamic range. Matlab shows that there are only 4 grayscale levels from 0 to 255 used, but middle two levels occur twice as often as the edge levels. Using -depth 4, I found that only the minimum and maximum grayscale levels are always half of the uniform distribution among all the other grayscale levels. The reason for this became apparent when I plotted the mapping of gray levels in rnd_b8.pdf compared to the 4-bit depth counterpart:

enter image description here

The "bins" of 8-bit gray level values that map to the minimum and maximum 4-bit gray levels is half as wide as for the other 4-bit gray levels. It might be because the bins are symmetrically defined such that (for example), the values that map to zero include negative and positive values. This wastes half the bin, because it lies outside the range of the input data.

The take-away is that one can use the -depth specification to convert, but for small bit depths, it is not ideal because it doesn't maximize the information in the bits.

AFTERNOTE: And interesting beneficial effect that I observed, which is obvious in hindsight, especially in light of Cris Luengo's comment. If the images in the PDF do indeed have limited bit depth, e.g., 4 bits, then you can extract them with pdfimages and re-package them in PDF without worrying too much about specifyng the right -depth. In the re-packaging into PDF, I noticed that the result of -depth 5 and -depth 6 did not increase the PDF file size much over -depth 4 because the default compression squeezes out any space wasted in the 8-bit image within the PDF. Subjectively, the quality remains the same too. If I specify a -depth 3 or below, however, the PDF file size decreases more noticeably, and the quality declines noticeably too.

Further helpful observations: After the better part of a year, I had a need to package scanned files into a PDF file again, but this time, I used a scanner that created PNG files for each page. I had no desire to re-spend the time taken above to reverse-engineer the behaviour of ImageMagick tools. Not being bogged down in the weeds, I was able to to notice three helpful code idiom details, at least to me, and I hope it helps someone else. For context, assume that you want to downgrade the grayscale depth to 2 bits, which allows for 4 levels. I found this to be plenty for scanned text documents, with neglegible loss in readability.

First, if you scanned in (say) 200 dpi grayscale, and you want to downgrade to 2 bits, you need specify the -density prior to the first (input) file: convert -density 200x200 -depth 2 input.png output.pdf. Not doing so yields extremely coarse resolution, even though pdfimage -list shows 200x200. Second, you want to use one convert statement to convert a collection of PNG files to a single depth-limited PDF file. I found this out because I initially converted multiple PNG files into one PDF file, then converted to a depth of 2. The file size shrinks, but not nearly as much as it could. In fact, if when I had only 1 input file, the size actually increased by a third. So the ideal pattern for me was convert -density 200x200 -depth 2 input1.png input2.png output.pdf. Third, documents manually scanned one page at a time often need page rotation adjustments, and web searching yields the recommendation to use pdftk rather than (say) convert (well discussed here). The rationale is that convert rasterizes. Even though scans are rasterized, I elected to use pdftk to avoid the possibility of re-rasterizing, and the associated possibility of degraded fidelity. pdfjam might also do nicely, but starting code patterns for page-specific rotations were already given for pdftk. From experimentation, the pattern for me was (say) pdftk input.pdf cat 1west 2east 3east output output.pdf.

user36800
  • 2,019
  • 2
  • 19
  • 34
  • You are certainly correct that the two end bins are narrower - you can see that from the very first command in my answer. I am still not sure that testing compression of a random image is a good test. Initially the problem you wanted to address was that your scanned files were unnecessarily large - so, surely the true test is how the parameters affect your scanned images... – Mark Setchell May 28 '18 at 07:36
  • As for using random grayscale PNGs, I'm not suggesting that the compression is representative of scanned documents. I just wanted to figure out what `convert` was doing and how to interpret the `pdfimages` table of information. Based on this, I now have a sense of how much confidence to have in using them for other documents. I will keep an eye on the sizes and subjective image quality to see if the evidence departs significantly from my conjectured mental model of its operation. – user36800 May 28 '18 at 12:08
  • As for seeing the weird subsamping of grayscale levels in the gradient image, I did notice that your gradient had smaller white and black extremes. It didn't occur to me to look back and make sense of it after figuring out convert's weird subsampling of grayscale levels. In light of you comments about the nonrepresentativeness of random grayscale test images, I'm beginning to better appreciate the use of the gradient image. I'm not sure how much entropy there is in a gradient, but it might be closer than the typical entropy of scanned documents. – user36800 May 28 '18 at 12:22
  • Note that my understanding of entropy is limited at best, not working in image processing, compression, or enryption. It's based on 3rd year ungrad EE. I know it's probability based, and hence affected by distribution of pixels across grayscale levels. I don't know if it is reduced by the obvious deterministic spatial pattern in the gradient, the correlation between nearby pixels, and the skewing of spectral energy toward low spatial frequencies. I've never gotten my head around 2D Fourier Transforms. – user36800 May 28 '18 at 12:40
1

Updated Answer

I am still looking at this. One thing I have noticed is that it does appear to honour compression when writing PDFs...

# Without compression
convert -depth 2 -size 1024x768 gradient: a.pdf
pdfimages -list a.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1     0 image    1024   768  gray    1   8  image  no         8  0    72    72 12.1K 1.6%

# With compression
convert -depth 2 -size 1024x768 gradient: -compress lzw a.pdf
pdfimages -list a.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1024   768  gray    1   8  image  no         8  0    72    72 3360B 0.4%

You can list the available types of compression with:

identify -list compress

It seems to accept the following for PDF output:

  • JPEG
  • LZW
  • ZIP

Note that your test images do not achieve very good compression, but then again, consider how representative they really are of your documents - they look very random and such things always compress poorly.

Initial Answer

Please try adding:

-define png:bit-depth=2

and/or

-define png:color-type=X

where X is either 0 (grayscale) or 3 (indexed, i.e. palettised)

So, specifically:

convert image1.png image2.png -define <AS ABOVE> output.pdf
Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • Thanks for the attempt, Mark. The results of `convert -list` don't change as a result of specifying your command line switches to the `convert` command for combining the PNGs into a PDF. I tried positioning them both as shown above as well as in the position of leading arguments. (I also tried this varying of positions for the `depth` argument in my original post.) For each of these two variations, I further alternated your `-define png:color-type=X` beteween X=0 and X=3. For both positionings of the arguments, I also tried not defining the `color-type` argument. – user36800 May 26 '18 at 13:07
  • Where can I find documentation on the `gradient:` specification to `compress`? It doesnt show up in the *man* page. Thanks... – user36800 May 26 '18 at 21:07
  • `-gradient:` just generates a gradient - from black to white in this instance to make some test data. It is not related to `-compress`. You can try `convert -size 1024x768 gradient: result.png` and the same again with `-depth 2` before the output filename. – Mark Setchell May 26 '18 at 21:14
  • Hmm, OK. I was hoping to find documentation on all these keywords. I usually rely on the fact that everything is in the man pages, no matter how impenetrable. I noted that the `compression` treats the `-compress` specification is case-insensitive (I use the case from `identify -list compress`). – user36800 May 28 '18 at 01:37
  • I'm puzzled by the `ratio` of 1.6% without `compress`, and 0.4% with `LZW`. Shouldn't it be *higher*? As well, I've been sanity checking `compress` and the `pdfimages` `ratio`. I captured this as comments under my original post. Frankly, I'm not sure what to make of the results, and hence, of the `compress` of `convert`, and the `ratio` of `pdfimages`. – user36800 May 28 '18 at 01:37