I have written code to parse a large set of emails (640,000 files) with the output being a listing of email filenames with specific dates. The code is as follows:
def createListOfFilesByDate():
searchDates = ["12 Mar 2012","13 Mar 2012"]
outfile = "EmailList.txt"
sent = "Sent:"
fileList=glob.glob("./Emails/*.txt")
foundDate = False
fout = open(outfile,'w')
for filename in fileList:
foundDate = False
with open(filename) as f:
header = [next(f) for x in xrange(10)]
f.close()
for line in header:
if sent in line:
for searchDate in searchDates:
if searchDate in line:
foundDate = True
break
if foundDate == True:
fout.write(filename + '\n')
break
fout.close()
The problem is that the code processes the first 10,000 emails quite quickly but then starts to slow down significantly and takes a long time to cover the remaining emails. I have investigated a number of possible reasons but not found one. I wonder if I am doing something inefficiently.