How to convert multipage PDF to list of image objects in Python?

Question

I'd like to turn a multipage PDF document into a series of image object in list structure, without saving the images in disk (I'd like to process them with PIL Image)in Python. So far I can only do this to write the images into files first:

from wand.image import Image

with Image(filename='source.pdf') as img:

    with img.convert('png') as converted:
        converted.save(filename='pyout/page.png')

But how could I turn the img objects above directly into list of PIL.Image objects?

score 8 · Answer 1 · edited Jun 20 '20 at 09:12

new answer:

pip install pdf2image

from pdf2image import convert_from_path, convert_from_bytes
images = convert_from_path('/path/to/my.pdf')

You may need to install pillow as well. This might only work on linux.

https://github.com/Belval/pdf2image

Results may be different between the two methods.

old answer:

Python 3.4:

from PIL import Image
from wand.image import Image as wimage
import os
import io

if __name__ == "__main__":
    filepath = "fill this in"
    assert os.path.exists(filepath)
    page_images = []
    with wimage(filename=filepath, resolution=200) as img:
        for page_wand_image_seq in img.sequence:
            page_wand_image = wimage(page_wand_image_seq)
            page_jpeg_bytes = page_wand_image.make_blob(format="jpeg")
            page_jpeg_data = io.BytesIO(page_jpeg_bytes)
            page_image = Image.open(page_jpeg_data)
            page_images.append(page_image)

Lastly, you can make a system call to mogrify, but that can be more complicated as you need to manage temporary files.

I've included an edit suggested by @jtlz2 that I can't accept because it's already been rejected. Basically making Image point to PIL.Image by default, rather than wand.image.Image which I think is very rare to use. — Bryant Kou, Jul 26 '18 at 17:18

yeachan park · Answer 2 · 2020-04-22T01:49:49.643

Simple way is to save image files and delete them after reading them using PIL.

I recommend to use pdf2image package. Before using pdf2image package, you might need to install poppler package via anaconda

conda install -c conda-forge poppler

If you are stuck, please update conda before installing :

conda update conda
conda update anaconda

After installing poppler, install pdf2image via pip :

pip install pdf2image

Then run this code :

from pdf2image import convert_from_path
dpi = 500 # dots per inch
pdf_file = 'work.pdf'
pages = convert_from_path(pdf_file ,dpi )
for i in range(len(pages)):
   page = pages[i]
   page.save('output_{}.jpg'.format(i), 'JPEG')

After this, please read them using PIL and delete them.

score 1 · Answer 3 · answered Oct 21 '20 at 17:59

my answer with wand is the following:

from wand.image import Image as wi
...
Data = filedialog.askopenfilename(initialdir="/", title="Choose File", filetypes = (("Portable Document Format","*.pdf"),("All Files", "*.*")))
apps.append(Data)
print(Data)
PDFfile = wi(filename = Data, resolution = 300)
Images = PDFfile.convert('tiff')
ImageSequence = 1
for img in PDFfile.sequence:
    image = wi(image = img)
    image.save(filename = "Document_300"+"_"+str(ImageSequence)+".tiff")
    ImageSequence += 1

Hopefully this will help you.

I've implemented it with a GUI where you can simply choose your file.

You can also change the PDFfile.convert() in jpg etc.

score -1 · Answer 4 · answered Oct 20 '21 at 19:48

Download Poppler from here https://blog.alivate.com.au/poppler-windows/ , then use the following code:

from pdf2image import convert_from_path

file_name = 'A019'
images = convert_from_path(r'D:\{}.pdf'.format(file_name), poppler_path=r'C:\poppler-0.68.0\bin')

for i, im in enumerate(images):
    im.save(r'D:\{}-{}.jpg'.format(file_name,i))

If you get an error because of poppler's path, add poppler's bin path to "Path" in windows environment variables. Path can be like this "C:\poppler-0.68.0\bin"

How to convert multipage PDF to list of image objects in Python?

4 Answers4

new answer:

old answer: