0

I have written code to parse a large set of emails (640,000 files) with the output being a listing of email filenames with specific dates. The code is as follows:

def createListOfFilesByDate():

    searchDates = ["12 Mar 2012","13 Mar 2012"]
    outfile = "EmailList.txt"    
    sent = "Sent:"

    fileList=glob.glob("./Emails/*.txt")

    foundDate = False

    fout = open(outfile,'w')

    for filename in fileList:

        foundDate = False

        with open(filename) as f:                  
            header = [next(f) for x in xrange(10)]           
            f.close()

            for line in header:            
                if sent in line:
                    for searchDate in searchDates:                                                    
                        if searchDate in line:
                            foundDate = True
                            break

                if foundDate == True:                                                    
                    fout.write(filename + '\n')
                    break

    fout.close()

The problem is that the code processes the first 10,000 emails quite quickly but then starts to slow down significantly and takes a long time to cover the remaining emails. I have investigated a number of possible reasons but not found one. I wonder if I am doing something inefficiently.

martineau
  • 119,623
  • 25
  • 170
  • 301
Dom
  • 160
  • 11
  • There are bugs in your code. How they relate to the slow-down is not clear. Anyway, you should not call `f.close()` because you already have `with`, you should have `with` for `fout`as well, `if foundDate == True` should not be in `for line in header` loop. BTW, reading first 10 lines is wrong. Headers could be much longer, they end with an empty line, there can be wrapping (one header in multiple lines), `foundDate = False` outside of `for filename in fileList` means nothing. In the end, to see what is slow, use a [profiler](https://docs.python.org/2/library/profile.html) – zvone Mar 11 '17 at 15:14
  • Did you check if you have memory issue while running your script ? It seems pretty efficient for me. Maybe, try to keep your result in memory and write everything when your script is done. – Thom Mar 11 '17 at 15:14
  • How much data is being buffered before it tries to fflush it to disc? – David Bern Mar 11 '17 at 15:37
  • It's not at all certain that this is a bug in your code *at all*. If you've got a writeahead cache anywhere -- which could even be in your RAID controller, particularly common if they're battery-backed -- then writes that fit into the cache will be much faster than ones when it's full and needs to wait for the buffer to flush. Thus, disk I/O can often be bursty regardless of application behavior. – Charles Duffy Mar 11 '17 at 15:55
  • Profile your code and see where it's spending most of it's time so you—and everyone else— can stop guessing. It's easy to do, see [**_How can you profile a script?_**](http://stackoverflow.com/questions/582336/how-can-you-profile-a-script) – martineau Mar 11 '17 at 16:00
  • @zvone Thanks. Removing fclose has no effect so I assume the file closed automatically. I read first 10 lines to save time. I did try reading all of the email file but it did not change the results and only made it slower. The foundDate == false is to break out of the for line in header loop to save time. Could I do a double break at the line after foundDate = true ? I tried a profiler and all of the time is spent doing {open}. – Dom Mar 11 '17 at 16:30
  • If all of the time is spent in open, there is probably nothing you can do. It is a limitation of the file system. The only mistery is why it is not slow from the start. Maybe the first files are cached from your previous attempts, so they are faster, maybe the disk is fragmented, mybe there is a significant systematic difference in the size of first 10000 files compeared to the rest... – zvone Mar 11 '17 at 16:47
  • I think it is cached. If I break it and rerun it then it runs quickly to where it stopped previously and then slows down. Thanks for your help. – Dom Mar 11 '17 at 17:10

0 Answers0