3

I have a directory with many files that are named like:

1234_part1.pdf
1234.pdf
5432_part1.pdf
5432.pdf
2323_part1.pdf
2323.pdf
etc.

I am trying to merge the pdf where the the first number part of the file are the same. I have code that can do this one at a time, but when I have over 500 files in the directory I am not sure how to loop through, here is what I have so far:

from PyPDF2 import PdfFileMerger, PdfFileReader
merger = PdfFileMerger()
merger.append(PdfFileReader(file('c:/example/1234_part1.pdf', 'rb')))
merger.append(PdfFileReader(file('c:/example/1234.pdf', 'rb')))
merger.write("c:/example/ouput/1234_combined.pdf")

Ideally the output file would be 'xxxx_combined_<today's date>.pdf'. i.e. 1234_combined_051719.pdf

And also if there is a number file that only has part 1 or the other file it would not combined — i.e. if there was a 9999_part1.pdf, but no 9999.pdf, then there would be no output for the '9999_combined_<today's date>.pdf'.

martineau
  • 119,623
  • 25
  • 170
  • 301
Lulumocha
  • 143
  • 8
  • In order to provide the best answer, I need to know some more things. 1) Can we assume the numbers in the file name are always the same number of digits? 2) Will there only ever be 1 or 2 files with matching numbers, or could there be more? – Owen May 17 '19 at 19:41
  • Thanks, the numbers will always be 4 digits and be in the beginning of the file. There should always be 1 or 2 files with the same number – Lulumocha May 17 '19 at 19:46
  • Hi @Lulumocha, is your question solved? If so, feel free to mark an answer as accepted. – Amir Shabani May 22 '19 at 00:31

2 Answers2

2

Try using os.listdir() to get all of the files in your directory Then use .split() at the end of your string (filename) to isolate the pdf file number. Then look for that number pattern in the list of files you made.

import os
from PyPDF2 import PdfFileMerger, PdfFileReader

dir = 'my/dir/of/pdfs/'
file_list = os.listdir(dir)
num_list = []

for fname in file_list:
    if '_' in fname:  # if the filename has an underscore in it
        file_num = fname.split('_')[0]  # get's first element in list of splits
    else:
        file_num = fname.split('.')[0]
    if file_num not in num_list:
        num_list.append(file_num)

# now you have a list of all of your file numbers you can grab all files
# in the file_list containing that number
for num in num_list:
    pdf_parts = [x for x in file_list if num in x] # grabs all files with that number
    if len(pdf_parts < 2):  # if there is only one pdf with that num ...
        continue  # skip it!
    # your pdf append operation here for each item in the pdf_parts list.
    # something like this maybe ...

    merger = PdfFileMerger()
    # sorts list by filename length in decending order so that 
    # '_part' files come first
    sorted_pdf_parts = pdf_parts.sort(key=len, reverse=True) 
    for part in sorted_pdf_parts:
        merger.append(PdfFileReader(file(dir + part, 'rb')))
    merger.write('out/dir/' + num + '_combined.pdf')

Jello
  • 420
  • 6
  • 13
  • This is great, works perfectly. Thank you. I added the datetime.datetime.now() into the merger.write and can get everything I need now. – Lulumocha May 17 '19 at 20:38
  • It [can](https://stackoverflow.com/a/56193616/6282576) be done with only one `for loop` instead of three, which I believe is faster and easier to understand. – Amir Shabani May 17 '19 at 21:18
0

You can do it like this:

from PyPDF2 import PdfFileMerger, PdfFileReader
from os import listdir
from datetime import datetime

file_names = listdir('D:\Code\python-examples\PDF')

for file_name in file_names:
    if "_" in file_name:
        digits = file_name.split('_')[0]
        if f'{digits}.pdf' in file_names:
            with open(f'{digits}.pdf', 'rb') as digit_file, open(f'{digits}_part1.pdf', 'rb') as part1_file:
                merger = PdfFileMerger()
                merger.append(PdfFileReader(part1_file))
                merger.append(PdfFileReader(digit_file))
                merger.write(f'{digits}_combined_{datetime.now().strftime("%m%d%y")}.pdf')

A couple of notes:

  • It's recommended to use with when opening files.
  • You can use datetime.now().strftime("%m%d%y") to get the date format you mentioned.

So if we have a folder like this:

initial folder

After we run the code, we have:

done

And we can see that it works:

works

I also uploaded the code, along with relevant files, to my GitHub page. If anyone wants to try it themselves, they can check it out.

Amir Shabani
  • 3,857
  • 6
  • 30
  • 67