Extract pdfs from a directory and output images to a different directory with pdf2image

Question

I'm trying to read in some pdfs located in a directory, and outputting images of their pages in a different directory.

(I'm seeking to learn how this code works and I am hoping there's a cleaner way to specify an output directory for my image files.)

What I've done works, but I think it is just bouncing back and forth between my save directory and my pdf directory.

This doesn't feel like a clean approach. Is there a better option, which preserves the existing code and accomplishes what my added lines do?

import os
from pdf2image import convert_from_path

pdf_dir = r"mydirectorypathwithPDFs"
save_dir = 'mydirectorypathforimages'

os.chdir(pdf_dir)

for pdf_file in os.listdir(pdf_dir):
    os.chdir(pdf_dir) #I added this, change back to the pdf directory
    if pdf_file.endswith(".pdf"):
        pages = convert_from_path(pdf_file, 300)
        pdf_file = pdf_file[:-4]
        for page in pages:
            os.chdir(save_dir) #I added this, change to the save directory
            page.save("%s-page%d.jpg" % (pdf_file,pages.index(page)), "JPEG")

The code I slightly modified was created by @photek1944 and found here: https://stackoverflow.com/a/53463015/10216912

I don't see a problem with this. What do you mean by not clean? If you prefer not to switch directories, then you can just save to the absolute path, or use relative paths, then you don't need any os commands — MaxYarmolinsky, Jan 01 '21 at 18:59
Hi, thanks Max, yes I think that's what I'd like is to just specify the input and output paths ahead of time — Chris Fetterly, Jan 01 '21 at 19:15
I don't understand why you use `chdir(pdf_dir)` and `listdir(pdf_dir)` together. If you use `chdir(pdf_dir)` then you can use `listdir()` without argument. Or if you use `listdir(pdf_dir)` then you don't need `chdir(pdf_dir)` — furas, Jan 01 '21 at 20:18
If you do not want to change the working directory back-and-forth over-and-over-again, then use absolute paths instead of relative paths. — Toothpick Anemone, Jan 01 '21 at 22:25
Thanks all, would someone be able to show me how to use absolute paths? I think that's my hang up, I just don't know how to do it effectively. — Chris Fetterly, Jan 01 '21 at 23:11

CrazyChucky · Accepted Answer · 2021-01-03T23:16:34.553

This might go a little beyond the scope of exactly what you asked, but anytime someone's looking to streamline code involving os for manipulating paths and files, I always like to recommend Python's pathlib module, because it is awesome. Here's how I personally would implement your program:

from pathlib import Path
from pdf2image import convert_from_path

# Use forward slashes here, even if you're on Windows.
pdf_dir = Path('my/directory/path/with/PDFs')
save_dir = Path('my/directory/path/for/images')

for pdf_file in pdf_dir.glob('*.pdf'):
    pages = convert_from_path(pdf_file, 300)
    for num, page in enumerate(pages, start=1):
        page.save(save_dir / f'{pdf_file.stem}-page{num}.jpg', 'JPEG')

pathlib automatically handles providing the right separator (\ on Windows and / mostly everywhere else), it lets you add onto paths with / as an operator, and it makes searching through a folder particularly convenient with the glob method. It also exposes properties like name (blah.pdf), stem (blah), and extension (.pdf) to more easily access the parts of the path and file name.

I'm also using an f-string for more readable formatting, and enumerate to track the page numbers. (I've set it to start at 1; I believe your original code would number the first page as 0.)

Extract pdfs from a directory and output images to a different directory with pdf2image

1 Answers1