How to distinguish a PDF file from other files?

Question

I have to deal with lot of files. How can I distinguish which one is a PDF file and which one is not ? I am running Python on Windows. Thanks for help please.

If you can't trust the file name, and want to check the contents, see http://stackoverflow.com/q/10937350/ — Dan Getz, Jul 24 '14 at 15:45

dawg · Answer 1 · 2014-07-26T15:28:11.183

1

if you don't trust the extension of the file name, you can read the first few bytes of the file and test if it starts with %PDF-

Like so:

with open(fn, 'rb') as fin:
    line=fin.read(20)
    if line.startswith('%PDF-'):
        # its a pdf file...
        # you can parse the version of PDF by the versioning x.x after %PDF-x.x
    else:
        # it is not a pdf file

edited Jul 26 '14 at 15:28

answered Jul 25 '14 at 18:25

dawg

98,345
23
131
206

score 1 · Accepted Answer · answered Jul 27 '14 at 09:55

If you want to rely on the file extension, you can use the following code:

#!python3

import os

def isPDFfile(fname):
    name, ext = os.path.splitext(fname)
    return ext.lower() == '.pdf'

if __name__ == '__main__':
    for fname in os.listdir('.'):
        if isPDFfile(fname):
            print(fname, 'is PDF file.')
        else:
            print(fname, 'is not PDF file.')

If you want to be sure that the name is not a directory, you can add the test:

def isPDFfile(fname):
    if not os.path.isfile(fname):
        return False
    name, ext = os.path.splitext(fname)
    return ext.lower() == '.pdf'

There is also os.walk() function that iterates through the files in a directory. If you want to find all PDF files inside a directory, you can write your own specialized walk that will return only PDF files:

def walkPDFfiles(directory):
    for dirpath, dirs, files in os.walk(directory):
        for fname in files:
            name, ext = os.path.splitext(fname)
            if ext.lower() == '.pdf':
                yield os.path.join(dirpath, fname)

And you can use it in the loop like that:

for fname in walkPDFfiles('.'):
    print(fname, 'is PDF file.')

@begueradj: When searching for the files with a different extension, the code must be modified. The extension(s) could be passed via an argument if wanted. — pepr, Jul 29 '14 at 07:48

How to distinguish a PDF file from other files?

2 Answers2