I am really a beginner at python, but trying to compare some data that has been extracted from two databases into files. In the script I am using a dictionary for each database content and if I find a difference I add it to the dictionary. They keys being the combination of the first two values (code and subCode) and the value is a list of longCodes associated to that code/SubCode combination. Overall my script works, but wouldn't surprise me if its just horribly constructed and inefficient. Sample data that is processing is like:
0,0,83
0,1,157
1,1,158
1,2,159
1,3,210
2,0,211
2,1,212
2,2,213
2,2,214
2,2,215
The idea is that the data should be in sync, but sometimes it is not and I am trying to detect the differences. In reality when I extract data from the DBs there are over 1 million lines in each file. Performance is not that great it seems (maybe its as good as can be?), takes about 35 minutes to process and give me the results. If there are any suggestions for improving performance I will gladly accept!
import difflib, sys, csv, collections
masterDb = collections.OrderedDict()
slaveDb = collections.OrderedDict()
with open('masterDbCodes.lst','r') as f1, open('slaveDbCodes.lst','r') as f2:
diff = difflib.ndiff(f1.readlines(),f2.readlines())
for line in diff:
if line.startswith('-'):
line = line[2:]
codeSubCode = ",".join(line.split(",", 2)[:2])
longCode = ",".join(line.split(",", 2)[2:]).rstrip()
if not codeSubCode in masterDb:
masterDb[codeSubCode] = [(longCode)]
else:
masterDb[codeSubCode].append(longCode)
elif line.startswith('+'):
line = line[2:]
codeSubCode = ",".join(line.split(",", 2)[:2])
longCode = ",".join(line.split(",", 2)[2:]).rstrip()
if not codeSubCode in slaveDb:
slaveDb[codeSubCode] = [(longCode)]
else:
slaveDb[codeSubCode].append(longCode)
f1.close()
f2.close()