PyPDF2 placing pages 2-9 at random locations in merged .pdf; everything else in the correct order

Question

I'm using this code to merge 94 individual 1-page .pdf documents into a book.

from glob import glob
from PyPDF2 import PdfMerger


def pdf_merge():
    """Merges individual .pdfs into one large .pdf"""
    merger = PdfMerger()
    allpdfs = [a for a in glob("*.pdf")]
    [merger.append(pdf) for pdf in allpdfs]
    with open("book.pdf", "wb") as new_file:
        merger.write(new_file)


if __name__ == "__main__":
    pdf_merge()

This script is placed in a directory with the individual "pages" of the final book file (94 individual .pdf files consisting of one page each). Each filename is formatted as page_X.pdf where "X" is the number of the page, beginning with 1 and ending with 94, i.e. "page_1" through "page_94".

Everything runs smoothly and I get a .pdf at the end called book.pdf. The majority of the pages are in the correct order. However strangely pages 2-9 are scattered at seemingly random intervals throughout. I.e., page 1 is correct, then 2-9 are missing, so the 2nd page is page 10; and as you continue everything is in the right order except 2-9 which you stumble upon at times.

Thanks for your help.

Are you sorting to ensure proper order is written out the way you need? You can split the filename and sort by `int(page_num)` — ViaTech, Jan 14 '23 at 19:21
Hmm thank you. I think I don't understand the way it determines page order. Is there a way to order pages based on filename or is that automatic? And if so then why don't the files named page_1 ... through page_94 fall into place? — upquark00, Jan 14 '23 at 19:35
Ah I got what you meant. Sorry for the rudimentary question. It turns out it's ordering pages like this: ['page_1.pdf', 'page_10.pdf', 'page_11.pdf', 'page_12.pdf', 'page_13.pdf', 'page_14.pdf', 'page_15.pdf', 'page_16.pdf', 'page_17.pdf', 'page_18.pdf',...], which is odd in my opinion... 10 after 1 instead of 2. Ha. But now I can fix it. Thanks. — upquark00, Jan 14 '23 at 19:49
check the order with `print(glob("*.pdf"))`, maybe your question has nothing to do with pdfs and could just be `sorted(glob("*.pdf"), key=lambda s: int(s.removesuffix('.pdf').partition('_')[-1]) )` — cards, Jan 14 '23 at 20:54
or `key = lambda s: int(re.findall("[0-9]+", s)[0])` that will find the numbers anywhere they are in the file name — Caridorc, Jan 15 '23 at 12:31
You might be interested in https://pypi.org/project/natsort/ — Martin Thoma, Jan 15 '23 at 15:22

PyPDF2 placing pages 2-9 at random locations in merged .pdf; everything else in the correct order

0 Answers0