3

The code I am working with takes in a .pdf file, and outputs a .txt file. My question is, how do I create a loop (probably a for loop) which runs the code over and over again on all files in a folder which end in ".pdf"? Furthermore, how do I change the output each time the loop runs so that I can write a new file each time, that has the same name as the input file (ie. 1_pet.pdf > 1_pet.txt, 2_pet.pdf > 2_pet.txt, etc.)

Here is the code so far:

path="2_pet.pdf"
content = getPDFContent(path)
encoded = content.encode("utf-8")
text_file = open("Output.txt", "w")
text_file.write(encoded)
text_file.close()
Geeocode
  • 5,705
  • 3
  • 20
  • 34
Jack Bunce
  • 43
  • 1
  • 3
  • possible duplicate of [Find all files in directory with extension .txt with python](http://stackoverflow.com/questions/3964681/find-all-files-in-directory-with-extension-txt-with-python) – Red Shift Jul 21 '15 at 17:48

4 Answers4

5

The following script solve your problem:

import os

sourcedir = 'pdfdir'

dl = os.listdir('pdfdir')

for f in dl:
    fs = f.split(".")
    if fs[1] == "pdf":
        path_in = os.path.join(dl,f)
        content = getPDFContent(path_in)
        encoded = content.encode("utf-8")
        path_out = os.path.join(dl,fs[0] + ".txt")
        text_file = open(path_out, 'w')
        text_file.write(encoded)
        text_file.close()
Geeocode
  • 5,705
  • 3
  • 20
  • 34
1

One way to operate on all PDF files in a directory is to invoke glob.glob() and iterate over the results:

import glob
for path in glob.glob('*.pdf')
    content = getPDFContent(path)
    encoded = content.encode("utf-8")
    text_file = open("Output.txt", "w")
    text_file.write(encoded)
    text_file.close()

Another way is to allow the user to specify the files:

import sys
for path in sys.argv[1:]:
    ...

Then the user runs your script like python foo.py *.pdf.

Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • I just added this to my code, and it ran without returning any errors, however my output file only relates to my first pdf file. Is there a reason why it might not be running past the first file? Also, how do I go about changing the output during each iteration of the for loop to mirror the file name of the pdf file? – Jack Bunce Jul 21 '15 at 18:05
1

Create a function that encapsulates what you want to do to each file.

import os.path

def parse_pdf(filename):
    "Parse a pdf into text"
    content = getPDFContent(filename)
    encoded = content.encode("utf-8")
    ## split of the pdf extension to add .txt instead.
    (root, _) = os.path.splitext(filename)
    text_file = open(root + ".txt", "w")
    text_file.write(encoded)
    text_file.close()

Then apply this function to a list of filenames, like so:

for f in files:
    parse_pdf(f)
ajerneck
  • 751
  • 1
  • 7
  • 19
  • This looks like it will work! The problem is that I need files to refer to my directory. Would I do it like this? `files = "Users/Jack/Downloads/pyPdf-1.13"` – Jack Bunce Jul 21 '15 at 18:16
  • You can get the files from a directory using glob, like in Rob's answer – ajerneck Jul 21 '15 at 18:19
  • That helped, and actually worked (kind of). I now am having the issue that the named text files are being returned, but they are blank, and when I try to do a few hundred files, I get the error `pyPdf.utils.PdfReadError: EOF marker not found`. Do you have any idea why either of these are happening? I really appreciate your help! – Jack Bunce Jul 21 '15 at 18:26
  • check that `content.encode` is not returning `None`: I think it might change the encoding "in place". Try adding `print encoded` for example, to see if it is what you expect. – ajerneck Jul 21 '15 at 18:29
  • This answer doesn't cover all the aspect of the OP question i.e. "on all files in a the folder". – Geeocode Jul 21 '15 at 19:45
0

You could use a recursive function to search the folders and all subfolders for files that end with pdf. Than take those files and then create a text file for it.

It could be something like:

import os

def convert_PDF(path, func):
    d = os.path.basename(path)
    if os.path.isdir(path):
        [convert_PDF(os.path.join(path,x), func) for x in os.listdir(path)]
    elif d[-4:] == '.pdf':
        funct(path)

# based entirely on your example code
def convert_to_txt(path):
    content = getPDFContent(path)
    encoded = content.encode("utf-8")
    file_path = os.path.dirname(path)
    # replace pdf with txt extension
    file_name = os.path.basename(path)[:-4]+'.txt'
    text_file = open(file_path +'/'+file_name, "w")
    text_file.write(encoded)
    text_file.close()

convert_PDF('path/to/files', convert_to_txt)

Because the actual operation is changeable, you can replace the function with whatever operation you need to perform (like using a different library, converting to a different type, etc.)

DFenstermacher
  • 564
  • 1
  • 9
  • 23