I have to deal with lot of files. How can I distinguish which one is a PDF file and which one is not ? I am running Python on Windows. Thanks for help please.
Asked
Active
Viewed 101 times
-2
-
1If you can't trust the file name, and want to check the contents, see http://stackoverflow.com/q/10937350/ – Dan Getz Jul 24 '14 at 15:45
2 Answers
1
if you don't trust the extension of the file name, you can read the first few bytes of the file and test if it starts with %PDF-
Like so:
with open(fn, 'rb') as fin:
line=fin.read(20)
if line.startswith('%PDF-'):
# its a pdf file...
# you can parse the version of PDF by the versioning x.x after %PDF-x.x
else:
# it is not a pdf file

dawg
- 98,345
- 23
- 131
- 206
1
If you want to rely on the file extension, you can use the following code:
#!python3
import os
def isPDFfile(fname):
name, ext = os.path.splitext(fname)
return ext.lower() == '.pdf'
if __name__ == '__main__':
for fname in os.listdir('.'):
if isPDFfile(fname):
print(fname, 'is PDF file.')
else:
print(fname, 'is not PDF file.')
If you want to be sure that the name is not a directory, you can add the test:
def isPDFfile(fname):
if not os.path.isfile(fname):
return False
name, ext = os.path.splitext(fname)
return ext.lower() == '.pdf'
There is also os.walk()
function that iterates through the files in a directory. If you want to find all PDF files inside a directory, you can write your own specialized walk that will return only PDF files:
def walkPDFfiles(directory):
for dirpath, dirs, files in os.walk(directory):
for fname in files:
name, ext = os.path.splitext(fname)
if ext.lower() == '.pdf':
yield os.path.join(dirpath, fname)
And you can use it in the loop like that:
for fname in walkPDFfiles('.'):
print(fname, 'is PDF file.')

pepr
- 20,112
- 15
- 76
- 139
-
-
1@begueradj: When searching for the files with a different extension, the code must be modified. The extension(s) could be passed via an argument if wanted. – pepr Jul 29 '14 at 07:48