1

I have been working on setting up a PDF conversion-to-png and cropping script with Python 3.6.3 and the wand library.

I tried Pillow, but it's lacking the conversion part. I am experimenting with extracting the alpha channel because I want to feed the images to an OCR, at a later point, so I turned to trying the code provided in this SO answer.

A couple of issues came out: the first is that if the file is large, I get a "Killed" message from the terminal. The second is that it seems rather picky with the file, i.e. files that get converted properly by imagemagick's convert or pdftoppm in the command line, raise errors with wand.

I am mostly concerned with the first one though, and would really appreciate a check from more knowledgeable coders. I suspect it might come from the way the loop is structured:

from wand.image import Image
from wand.color import Color


def convert_pdf(filename, path, resolution=300):
    all_pages = Image(filename=path+filename, resolution=resolution)
    for i, page in enumerate(all_pages.sequence):
        with Image(page) as img:
            img.format = 'png'
            img.background_color = Color('white')
            img.alpha_channel = 'remove'

            image_filename = '{}.png'.format(i)
            img.save(filename=path+image_filename)

I noted that the script outputs all files at the end of the process, rather than one by one, which I am guessing it might put unnecessary burden on memory, and ultimately cause a SEGFAULT or something similar.

Thanks for checking out my question, and for any hints.

Giampaolo Ferradini
  • 529
  • 1
  • 6
  • 17

1 Answers1

3

Yes, your line:

all_pages = Image(filename=path+filename, resolution=resolution)

Will start a GhostScript process to render the entire PDF to a huge temporary PNM file in /tmp. Wand will then load that massive file into memory and hand out pages from it as you loop.

The C API to MagickCore lets you specify which page to load, so you could perhaps render a page at a time, but I don't know how to get the Python wand interface to do that.

You could try pyvips. It renders PDFs incrementally by making direct calls to libpoppler, so there are no processes being started and stopped and no temporary files.

Example:

#!/usr/bin/python3

import sys
import pyvips

def convert_pdf(filename, resolution=300):
    # n is number of pages to load, -1 means load all pages
    all_pages = pyvips.Image.new_from_file(filename, dpi=resolution, n=-1, \
            access="sequential")

    # That'll be RGBA ... flatten out the alpha
    all_pages = all_pages.flatten(background=255)

    # the PDF is loaded as a very tall, thin image, with the pages joined
    # top-to-bottom ... we loop down the image cutting out each page
    n_pages = all_pages.get("n-pages")
    page_width = all_pages.width
    page_height = all_pages.height / n_pages

    for i in range(0, n_pages):
        page = all_pages.crop(0, i * page_height, page_width, page_height) 
        print("writing {}.tif ..".format(i))
        page.write_to_file("{}.tif".format(i))

convert_pdf(sys.argv[1])

On this 2015 laptop with this huge PDF, I see:

$ /usr/bin/time -f %M:%e ../pages.py ~/pics/Audi_US\ R8_2017-2.pdf 
writing 0.tif ..
writing 1.tif ..
....
writing 20.tif ..
720788:35.95

So 35s to render the entire document at 300dpi, and a peak memory use of 720MB.

jcupitt
  • 10,213
  • 2
  • 23
  • 39
  • Excellent, thanks. Will play with it a bit, and return with feedback and/or accept. – Giampaolo Ferradini Jan 30 '19 at 20:40
  • Sigh... I tried your solution @jcupitt but I hit a wall. Apparently pyvips [does not run well in Anaconda](https://github.com/libvips/pyvips/issues/60), and I am not expert enough to try to build it inside a virtual environment. Thanks a lot for your help, I would have loved the efficiency of pyvips. – Giampaolo Ferradini Feb 02 '19 at 17:54
  • 1
    Ah OK, yes, there's no Anaconda package for libvips. It has it's own system for native binaries and can't use standard packages. – jcupitt Feb 02 '19 at 18:36