1

I'm using this code to merge 94 individual 1-page .pdf documents into a book.

from glob import glob
from PyPDF2 import PdfMerger


def pdf_merge():
    """Merges individual .pdfs into one large .pdf"""
    merger = PdfMerger()
    allpdfs = [a for a in glob("*.pdf")]
    [merger.append(pdf) for pdf in allpdfs]
    with open("book.pdf", "wb") as new_file:
        merger.write(new_file)


if __name__ == "__main__":
    pdf_merge()

This script is placed in a directory with the individual "pages" of the final book file (94 individual .pdf files consisting of one page each). Each filename is formatted as page_X.pdf where "X" is the number of the page, beginning with 1 and ending with 94, i.e. "page_1" through "page_94".

Everything runs smoothly and I get a .pdf at the end called book.pdf. The majority of the pages are in the correct order. However strangely pages 2-9 are scattered at seemingly random intervals throughout. I.e., page 1 is correct, then 2-9 are missing, so the 2nd page is page 10; and as you continue everything is in the right order except 2-9 which you stumble upon at times.

Thanks for your help.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
upquark00
  • 53
  • 6
  • Are you sorting to ensure proper order is written out the way you need? You can split the filename and sort by `int(page_num)` – ViaTech Jan 14 '23 at 19:21
  • Hmm thank you. I think I don't understand the way it determines page order. Is there a way to order pages based on filename or is that automatic? And if so then why don't the files named page_1 ... through page_94 fall into place? – upquark00 Jan 14 '23 at 19:35
  • Ah I got what you meant. Sorry for the rudimentary question. It turns out it's ordering pages like this: ['page_1.pdf', 'page_10.pdf', 'page_11.pdf', 'page_12.pdf', 'page_13.pdf', 'page_14.pdf', 'page_15.pdf', 'page_16.pdf', 'page_17.pdf', 'page_18.pdf',...], which is odd in my opinion... 10 after 1 instead of 2. Ha. But now I can fix it. Thanks. – upquark00 Jan 14 '23 at 19:49
  • 2
    check the order with `print(glob("*.pdf"))`, maybe your question has nothing to do with pdfs and could just be `sorted(glob("*.pdf"), key=lambda s: int(s.removesuffix('.pdf').partition('_')[-1]) )` – cards Jan 14 '23 at 20:54
  • or `key = lambda s: int(re.findall("[0-9]+", s)[0])` that will find the numbers anywhere they are in the file name – Caridorc Jan 15 '23 at 12:31
  • You might be interested in https://pypi.org/project/natsort/ – Martin Thoma Jan 15 '23 at 15:22

0 Answers0