Finding and moving PDF and DOC files to different directories using python

Question

I trying to find and move resumes in PDF and DOC formate to different directories, PDF files in /PDF dir and DOC files in /DOCX dir. My concerns are;

Are my regular expressions for finding the PDF and DOC files correct? The resumes are named as for example, john right ResumeQA.doc, abcResumeC.doc, ShawnResume.pdf, johnright_ResumeQA.pdf
I am not getting any counts or outputs on IDE nor in the output file.

Code that I came up with is below:

import os, sys, re

countpdf, countdoc = 0, 0

pdf = re.compile(r'\b\w*{resume}\w*\.[pdf]\b')
docx = re.compile(r'\b\w*{resume}\.[doc]\b]')

#os.mkdir(r'/Users/Desktop/Networking materials/PDF')
pdfdir = os.path.dirname(r'/Users/Desktop/Networking materials/PDF/')
print  pdfdir

#os.mkdir(r'/Users/Desktop/Networking materials/DOCX')
docxdir = os.path.dirname(r'/User/Desktop/Networking materials/DOCX/')
print docxdir

out = sys.stdout
with open('output.txt', 'w') as outfile:
     sys.stdout = outfile
     for rdir, directory, files in os.walk(r'/Users/Desktop/Networking materials/'):
         match1 = re.findall(pdf, str(files))
         print match1
         for items1 in match1:
             os.chdir(pdfdir)
             countpdf +=1
         print countpdf

         match2 = re.findall(docx, str(files))
         print match2
         for items2 in match2:
             os.chdir(docxdir)
             countdoc +=1
         print countdoc
         sys.stdout = out

The only output that I get so far is:

 /Users/Desktop/Networking materials/PDF
 /Users/Desktop/Networking materials/DOCX

Could anyone of you please correct my code and if possible please suggest a more efficient way to accomplish this task.

score 1 · Accepted Answer · edited May 23 '17 at 11:46

No, your regexes are not correct you can easily test them in python shell:

In [17]: a
Out[17]: 
[u'john right ResumeQA.doc',
 u' abcResumeC.doc',
 u' ShawnResume.pdf',
 u' johnright_ResumeQA.pdf']

In [20]: pdf = '\b\w*{resume}\w*\.[pdf]\b'

In [21]: for j in a:
    print re.findall(pdf, j)
   ....:     
[]
[]
[]
[]

as you see nothing matches. You should use some regex tester to check your regexes (eg. this).

I see that following regex:

pdf_re = ".+resume\w*\.pdf"
doc_re = ".+resume\w*\.doc"

should be completely fine as long as you pass re.I flag to regex, this will prompt regex engine to ignore case. Above regex for pdf should match any string that has some characters at the beginning (dot plus), followed by string 'resume' (case ignored) followed by 0 or more word like characters (so letters), followed by actual dot (. dot is special character so needs to be escaped), followed by string pdf.

re.findall(".+resume.*\.pdf", j, re.I)

Loooking at rest of your code.

This call: sys.stdout = outfile is not needed. If you want to write to file just use outputfile.write(content)

The way you search for files here match1 = re.findall(pdf, str(files)), this is not how you want to proceed. `files' contains list of files, you want to find specific file to move, you don't want to deal with all file names concatenated together.

Next thing: os.chdir actually changes working directory, it does not change location of files, it doesn't move files. To move file check this question on SO

So you need to do something along the lines of:

for rdir, directory, files in os.walk(r'/home/pawel/Documents'):
         for f in files:
             match = re.findall(pdf_re, f)
             if match:
                 matching_file = os.path.join(rdir, f)
                 target_location = os.path.join(pdfdir, f)
                 os.rename(matching_file, target_location)

Finding and moving PDF and DOC files to different directories using python

1 Answers1