-1

I tried to create a script to loop through parent folder and subfolders and merge all of the pdfs into one. Below if the code I wrote so far, but I don't know how to combine them into one script.

Reference: Merge PDF files

The first function is to loop through all of the subfolders under parent folder and get a list of path for each pdf.

import os
from PyPDF2 import PdfFileMerger

root = r"folder path"
path = os.path.join(root, "folder path")

def list_dir():
    for path,subdirs,files in os.walk(root):
        for name in files:
            if name.endswith(".pdf") or name.endswith(".ipynb"):
                print (os.path.join(path,name))

            
            

Second, I created a list to append all of the path to pdf files in the subfolders and merge into one combined file. At this step, I was told:

TypeError: listdir: path should be string, bytes, os.PathLike or None, not list

root_folder = []
root_folder.append(list_dir())
    
def pdf_merge():
    
    merger = PdfFileMerger()    
    allpdfs = [a for a in os.listdir(root_folder)]

    
    for pdf in allpdfs:
        merger.append(open(pdf,'rb'))
        
    with open("Combined.pdf","wb") as new_file:
        merger.write(new_file)

pdf_merge()

Where and what should I modify the code in order to avoid the error and also combine two functions together?

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Brian C.
  • 3
  • 3
  • I don't know why you need both scripts - simply runs `merger.append(open(pdf,'rb'))` directly in first script in place of `print()` – furas Jul 22 '21 at 05:15
  • you have few mistakes - first `list_dir()` only display paths but you have to create list with paths and use `return` to send it back. And later you need only `root_folder = list_dir()` or directly `all_pdfs = list_dir()` – furas Jul 22 '21 at 05:16
  • @furas Thanks! But I was think the first function "list_dir" can be used in other script whenever I need to loop through the folders. – Brian C. Jul 22 '21 at 12:24

1 Answers1

0

First you have to create functions which create list with all files and return it.

def list_dir(root):
    result = []
    
    for path, dirs, files in os.walk(root):
        for name in files:
            if name.lower().endswith( (".pdf", ".ipynb") ):
                result.append(os.path.join(path, name))
                
    return result

I use also .lower() to catch extensions like .PDF.

endswith() can use tuple with all extensions.

It is good to get external values as arguments - list_dir(root) instead of list_dir()


And later you can use as

allpdfs = list_dir("folder path")

in

def pdf_merge(root):
    
    merger = PdfFileMerger()    
    allpdfs = list_dir(root)
    
    for pdf in allpdfs:
        merger.append(open(pdf, 'rb'))
        
    with open("Combined.pdf", 'wb') as new_file:
        merger.write(new_file)

pdf_merge("folder path")

EDIT:

First function could be even more universal if it would get also extensions

import os

def list_dir(root, exts=None):
    result = []
    
    for path, dirs, files in os.walk(root):
        for name in files:
            if exts and not name.lower().endswith(exts):
               continue 

            result.append(os.path.join(path, name))
                
    return result

all_files  = list_dir('folder_path')
all_pdfs   = list_dir('folder_path', '.pdf')
all_images = list_dir('folder_path', ('.png', '.jpg', '.gif'))

print(all_files)
print(all_pdfs)
print(all_images)

EDIT:

For single extension you can also do

improt glob

all_pdfs = glob.glob('folder_path/**/*.pdf', recursive=True)

It needs ** with recursive=True to search in subfolders.

furas
  • 134,197
  • 12
  • 106
  • 148
  • Hi furas, thanks a lot! This code works and I am also appreciated the **EDIT** version of code. But I have a question, it seems to work in my original folder path only, if I change `'folder_path'` to other folder, it will return an empty list. Could you let me know what did I do wrong? – Brian C. Jul 22 '21 at 13:00
  • if it is `relative path` then it may not find it and you may have to use `/full/path/to/folder`. – furas Jul 22 '21 at 13:39
  • `all_pdfs = list_dir('H:\Brian\GIS', '.pdf')` This works. `all_pdfs = list_dir('H:\Brian\NAK Progress Report', '.pdf')` This one shows unicode error. `H:\Brian\Python Script\test_tbdeleted1` This one returns empty result bit it does have pdf inside. – Brian C. Jul 22 '21 at 13:44
  • char \ has special meaning in string - even in path - so you should use `H:\\\Brian\\GIS` or prefix `r` for raw string `r'H:\Brian\GIS'`. – furas Jul 22 '21 at 13:46
  • and `\t` means tabulator - even in `\test_tbdeleted1` – furas Jul 22 '21 at 13:49
  • and it seems `\N` is special char for some `unicode` but I don't know details - I never met it before. – furas Jul 22 '21 at 13:52
  • `print('\N{grinning face} \N{Nerd Face}')` gives me emoji – furas Jul 22 '21 at 14:04