Find differences between two files very slow

Question

I am really a beginner at python, but trying to compare some data that has been extracted from two databases into files. In the script I am using a dictionary for each database content and if I find a difference I add it to the dictionary. They keys being the combination of the first two values (code and subCode) and the value is a list of longCodes associated to that code/SubCode combination. Overall my script works, but wouldn't surprise me if its just horribly constructed and inefficient. Sample data that is processing is like:

0,0,83
0,1,157
1,1,158
1,2,159
1,3,210
2,0,211
2,1,212
2,2,213
2,2,214
2,2,215

The idea is that the data should be in sync, but sometimes it is not and I am trying to detect the differences. In reality when I extract data from the DBs there are over 1 million lines in each file. Performance is not that great it seems (maybe its as good as can be?), takes about 35 minutes to process and give me the results. If there are any suggestions for improving performance I will gladly accept!

import difflib, sys, csv, collections

masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
with open('masterDbCodes.lst','r') as f1, open('slaveDbCodes.lst','r') as f2:
    diff = difflib.ndiff(f1.readlines(),f2.readlines())
    for line in diff:
        if line.startswith('-'):
            line = line[2:]
            codeSubCode = ",".join(line.split(",", 2)[:2])
            longCode = ",".join(line.split(",", 2)[2:]).rstrip()
            if not codeSubCode in masterDb:
                masterDb[codeSubCode] = [(longCode)]
            else:
                masterDb[codeSubCode].append(longCode)
        elif line.startswith('+'):
            line = line[2:]
            codeSubCode = ",".join(line.split(",", 2)[:2])
            longCode = ",".join(line.split(",", 2)[2:]).rstrip()
            if not codeSubCode in slaveDb:
                slaveDb[codeSubCode] = [(longCode)]
            else:
                slaveDb[codeSubCode].append(longCode)

f1.close()
f2.close()

I don't know if it would be faster, but the `ordereddefaultdict` class defined at the start of [this answer](https://stackoverflow.com/a/4127426/355230) of mine to another question would allow you to get rid of of the four lines beginning with `if not subcode in xxxDb:` in each of the two cases and replace them with an unconditional `xxxDb..append(longCode)`. Note also you don't need to close the two files, `with` will do it automatically. — martineau, May 22 '17 at 18:55

score 1 · Answer 1 · edited May 22 '17 at 18:52

Try this:

import difflib, sys, csv, collections

masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
with open('masterDbCodes.lst','r') as f1, open('slaveDbCodes.lst','r') as f2:
    diff = difflib.ndiff(f1.readlines(),f2.readlines())
    for line in diff:
        if line.startswith('-'):
            line = line[2:]
            sp=",".join(line.split(",", 2)[:2])
            codeSubCode = sp
            longCode = sp.rstrip()
            try:
                masterDb[codeSubCode].append(longCode)
            except:
                masterDb[codeSubCode] = [(longCode)]
        elif line.startswith('+'):
            line = line[2:]
            sp=",".join(line.split(",", 2)[:2])
            codeSubCode = sp
            longCode = sp.rstrip()               
            try:
                slaveDb[codeSubCode].append(longCode)
            except:
                slaveDb[codeSubCode] = [(longCode)]

f1.close()
f2.close()

You might want to pinpoint the changes, possibly explaining in words what you changed and why. — Ami Tavory, May 22 '17 at 18:46
I did try your code changes, the processing time was 33 minutes, so there was some improvement in processing time. Thanks for your input. — ssbsts, May 22 '17 at 20:12

score 0 · Answer 2 · answered May 22 '17 at 21:39

So I ended up using different logic to come up with a much more efficient script. Big thanks to https://stackoverflow.com/users/100297/martijn-pieters for the assist.

#!/usr/bin/python

import csv, sys, collections

masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
outFile = open('results.csv', 'wb')

#First find entries in SLAVE that dont match MASTER
with open('masterDbCodes.lst', 'rb') as master:
    reader1 = csv.reader(master)
    master_rows = {tuple(r) for r in reader1}

with open('slaveDbCodes.lst', 'rb') as slave:
    reader = csv.reader(slave)

    for row in reader:
        if tuple(row) not in master_rows:
            code = row[0]
            subCode = row[1]
            codeSubCode = ",".join([code, subCode])
            longCode = row[2]
            if not codeSubCode in slaveDb:
                slaveDb[codeSubCode] = [(longCode)]
            else:
                slaveDb[codeSubCode].append(longCode)

#Now find entries in MASTER that dont match SLAVE
with open('slaveDbCodes.lst', 'rb') as slave:
    reader1 = csv.reader(slave)
    slave_rows = {tuple(r) for r in reader1}

with open('masterDbCodes.lst', 'rb') as master:
    reader = csv.reader(master)

    for row in reader:
        if tuple(row) not in slave_rows:
            code = row[0]
            subCode = row[1]
            codeSubCode = ",".join([code, subCode])
            longCode = row[2]
            if not codeSubCode in masterDb:
                masterDb[codeSubCode] = [(longCode)]
            else:
                masterDb[codeSubCode].append(longCode)

This solution can process the data (in fact twice) in about 10 seconds.

Find differences between two files very slow

2 Answers2