-2

I have to deal with lot of files. How can I distinguish which one is a PDF file and which one is not ? I am running Python on Windows. Thanks for help please.

  • 1
    If you can't trust the file name, and want to check the contents, see http://stackoverflow.com/q/10937350/ – Dan Getz Jul 24 '14 at 15:45

2 Answers2

1

if you don't trust the extension of the file name, you can read the first few bytes of the file and test if it starts with %PDF-

Like so:

with open(fn, 'rb') as fin:
    line=fin.read(20)
    if line.startswith('%PDF-'):
        # its a pdf file...
        # you can parse the version of PDF by the versioning x.x after %PDF-x.x
    else:
        # it is not a pdf file
dawg
  • 98,345
  • 23
  • 131
  • 206
1

If you want to rely on the file extension, you can use the following code:

#!python3

import os

def isPDFfile(fname):
    name, ext = os.path.splitext(fname)
    return ext.lower() == '.pdf'

if __name__ == '__main__':
    for fname in os.listdir('.'):
        if isPDFfile(fname):
            print(fname, 'is PDF file.')
        else:
            print(fname, 'is not PDF file.')

If you want to be sure that the name is not a directory, you can add the test:

def isPDFfile(fname):
    if not os.path.isfile(fname):
        return False
    name, ext = os.path.splitext(fname)
    return ext.lower() == '.pdf'

There is also os.walk() function that iterates through the files in a directory. If you want to find all PDF files inside a directory, you can write your own specialized walk that will return only PDF files:

def walkPDFfiles(directory):
    for dirpath, dirs, files in os.walk(directory):
        for fname in files:
            name, ext = os.path.splitext(fname)
            if ext.lower() == '.pdf':
                yield os.path.join(dirpath, fname) 

And you can use it in the loop like that:

for fname in walkPDFfiles('.'):
    print(fname, 'is PDF file.')
pepr
  • 20,112
  • 15
  • 76
  • 139
  • What if the question was about `.exe` files ? –  Jul 29 '14 at 07:31
  • 1
    @begueradj: When searching for the files with a different extension, the code must be modified. The extension(s) could be passed via an argument if wanted. – pepr Jul 29 '14 at 07:48