Python script that performs line matching over stale files generates inconsistent output

Question

I created a python script to parse mail (exim) logfiles and execute pattern matching in order to get a top 100 list for most send domains on my smtp servers. However, everytime I execute the script I get a different count. These are stale logfiles, and I cannot find a functional flaw in my code.

Example output:
    1:
    70353 gmail.com
    68337 hotmail.com
    53657 yahoo.com
    2:
    70020 gmail.com
    67741 hotmail.com
    54397 yahoo.com
    3:
    70191 gmail.com
    67917 hotmail.com
    54438 yahoo.com

Code:

#!/usr/bin/env python

import os
import datetime
import re
from collections import defaultdict

class DomainCounter(object):
def __init__(self):
    self.base_path = '/opt/mail_log'
    self.tmp = []
    self.date = datetime.date.today() - datetime.timedelta(days=14)
    self.file_out = '/var/tmp/parsed_exim_files-'  + str(self.date.strftime('%Y%m%d')) + '.decompressed'

def parse_log_files(self):
    sub_dir = os.listdir(self.base_path)
    for directory in sub_dir:
        if re.search('smtp\d+', directory):
            fileInput = self.base_path + '/' + directory + '/maillog-' + str(self.date.strftime('%Y%m%d')) + '.bz2'
            if not os.path.isfile(self.file_out):
                 os.popen('touch ' + self.file_out)
            proccessFiles = os.popen('/bin/bunzip2 -cd ' + fileInput + ' > ' + self.file_out)
            accessFileHandle =  open(self.file_out, 'r')
            readFileHandle = accessFileHandle.readlines()
            print "Proccessing %s." % fileInput
            for line in readFileHandle:
                if '<=' in line and ' for ' in line and '<>' not in line:
                    distinctLine = line.split(' for ')
                    recipientAddresses = distinctLine[1].strip()
                    recipientAddressList = recipientAddresses.strip().split(' ')
                    if len(recipientAddressList) > 1:
                        for emailaddress in recipientAddressList:
                            # Since syslog messages are transmitted over UDP some messages are dropped and needs to be filtered out.
                            if '@' in emailaddress:
                                (login, domein) = emailaddress.split("@")
                                self.tmp.append(domein)
                                continue
    else:
        try:
                                (login, domein) = recipientAddressList[0].split("@")
                                self.tmp.append(domein)
        except Exception as e:
             print e, '<<No valid email address found, skipping line>>'

    accessFileHandle.close()
    os.unlink(self.file_out)
return self.tmp


if __name__ == '__main__':
domainCounter = DomainCounter()
result = domainCounter.parse_log_files()
domainCounts = defaultdict(int)
top = 100
for domain in result:
    domainCounts[domain] += 1

sortedDict = dict(sorted(domainCounts.items(), key=lambda x: x[1], reverse=True)[:int(top)])
for w in sorted(sortedDict, key=sortedDict.get, reverse=True):
    print '%-3s %s' % (sortedDict[w], w)

try just running it on one file multiple times and make sure you get the same output ... also run it against a simplified file to ensure the output is what you expect — Joran Beasley, Mar 10 '14 at 15:33
There are some obvious quirks in your code. For instance, you call `sorted()` twice in the end. Why is that? Another unlogical thing, you have a `for .... else` construct, but don't use `break`. Sort these things out, before we proceed hunting that bug. — Dr. Jan-Philip Gehrcke, Mar 10 '14 at 15:35
Looks like you're doing a whole lot of acrobatics to parse the logfiles. Any chance you can paste a snapshot of what one of these logfiles looks like? — Joel Cornett, Mar 10 '14 at 15:36
Without digging too deep into your code, I guess your different results come from the fact that a dictionary does not store or guarantee order of items. That line `sortedDict = dict(sorted(domainCounts.items(), ...))` does not make sense, since you lose order in the moment you create a dictionary from an ordered sequence of 2-tuples. — Dr. Jan-Philip Gehrcke, Mar 10 '14 at 15:39
I have reformatted the code and fixed my for .... else construct. It does seem that the problem lies within my sorting mechanism. This solution is (to me) very complex however. Any ideas how to fix the dictionary sorting? — user.py, Mar 11 '14 at 09:55

score 1 · Accepted Answer · edited May 23 '17 at 12:29

proccessFiles = os.popen('/bin/bunzip2 -cd ' + fileInput + ' > ' + self.file_out)

This line is non-blocking. Therefore it will start the command, but the few following lines are already reading the file. This is basically a concurrency issue. Try to wait for the command to complete before reading the file.

Also see: Python popen command. Wait until the command is finished since os.popen is deprecated since python-2.6 (depending on which version you are using).

Sidenote - The same happens to the line below. The file may, or may not, exist after executing the following line:

os.popen('touch ' + self.file_out)

Python script that performs line matching over stale files generates inconsistent output

1 Answers1