I have been working on setting up a PDF conversion-to-png and cropping script with Python 3.6.3 and the wand library.
I tried Pillow, but it's lacking the conversion part. I am experimenting with extracting the alpha channel because I want to feed the images to an OCR, at a later point, so I turned to trying the code provided in this SO answer.
A couple of issues came out: the first is that if the file is large, I get a "Killed" message from the terminal. The second is that it seems rather picky with the file, i.e. files that get converted properly by imagemagick's convert or pdftoppm in the command line, raise errors with wand.
I am mostly concerned with the first one though, and would really appreciate a check from more knowledgeable coders. I suspect it might come from the way the loop is structured:
from wand.image import Image
from wand.color import Color
def convert_pdf(filename, path, resolution=300):
all_pages = Image(filename=path+filename, resolution=resolution)
for i, page in enumerate(all_pages.sequence):
with Image(page) as img:
img.format = 'png'
img.background_color = Color('white')
img.alpha_channel = 'remove'
image_filename = '{}.png'.format(i)
img.save(filename=path+image_filename)
I noted that the script outputs all files at the end of the process, rather than one by one, which I am guessing it might put unnecessary burden on memory, and ultimately cause a SEGFAULT or something similar.
Thanks for checking out my question, and for any hints.