2

I'm fairly new at programming and I am trying to write a python program that will compare 2 .csv files by specific columns and check for additions, removals, and modifications. The .csv files are both in the following format, contain the same amount of columns, and use BillingNumber as the key:

BillingNumber,CustomerName,IsActive,IsCreditHold,IsPayScan,City,State
"2","CHARLIE RYAN","Yes","No","Yes","Reading","PA"
"3","INSURANCE BILLS","","","","",""
"4","AAA","","","","",""

I need to compare only columns 0, 1, 2, and 4. I have tried many different ways to accomplish this but I haven't had any luck. I understand that I can load them into dictionaries using csv.DictReader or csv.reader, but after that I get stuck. I'm not sure exactly where or how to start after loading them into memory.

I tried this previously:

import time
old_lines = set((line.strip() for line in open(r'Old/file1.csv', 'r+')))
file_new = open(r'New/file2.csv', 'r+')

choice = 0
choice = int( input('\nPlease choose your result format.\nEnter 1 for .txt, 2 for .csv or 3 for .json\n') )
time.sleep(1)
print(".")
time.sleep(1)
print("..")
time.sleep(1)
print("...")
time.sleep(1)
print("....")
time.sleep(1)
print('Done! Check "Different" folder for results.\n')
if choice == 1:
    file_diff = open(r'Different/diff.txt', 'w')
elif choice == 2:
    file_diff = open(r'Different/diff.csv', 'w')
elif choice == 3:
    file_diff = open(r'Different/diff.json', "w")
else: 
    print ("You MUST enter 1, 2 or 3")
    exit()

for line in file_new:
    if line.strip() not in old_lines:
        file_diff.write("** ERROR! Entry "+ line + "** Does not match previous file\n\n")
file_new.close()
file_diff.close()

It doesn't work properly because if there is an additional line, or one is missing, it logs everything after that line as different. Also it compares the whole line which is not what I want to do. This was basically just a starting point and although it kind of worked, it isn't specific enough for what I need. I'm really just looking for a good place to start. Thanks!

Wilduck
  • 13,822
  • 10
  • 58
  • 90
user2120375
  • 31
  • 1
  • 3
  • This would be somewhere to start http://docs.python.org/2/library/csv.html – danodonovan Feb 28 '13 at 16:39
  • 2
    So it sounds like you're trying to write a diff program. That's a pretty in depth topic. Try http://stackoverflow.com/questions/805626/diff-algorithm or http://stackoverflow.com/questions/5897983/diff-algorithm-implementation-in-python. – Hoopdady Feb 28 '13 at 16:39
  • 1
    "I'm not sure exactly where or how to start after loading them into memory". There's a reason for this. You haven't rigorously described what you want. You said you want to "compare 2 .csv files by specific columns and check for additions, removals, and modifications". As Hoopdady mentioned, there is a lot covered by that simple statement. I would start by writing down, very carefully, exactly what you want to check for (like: what if there is an addition in one column and a deletion in another). Once you pose your question well enough, the implementation will become easier. – Wilduck Feb 28 '13 at 16:41
  • For instance let's say that there is a new key or a deleted key found, then I would like to output that record. If there is a modification of any data in columns 0, 1, 2, or 4 I would like to output that record. Basically I want to see if ANYTHING has changed in those columns then output any modified records to a third file. I have tried using csvDictReader, but then when I try to compare I get various errors including '_csv.reader' object has no attribute " " I'm not looking for an exact answer. – user2120375 Feb 28 '13 at 16:50

4 Answers4

1

I think you were on the right track using the csv module. Since 'BillingNumber' is a unique key, I would create one dict for the "old" billing file, and another for the "new" billing file:

import csv

def make_billing_dict(csv_dict_reader):
    bdict = {}
    for entry in csv_dict_reader:
        key = entry['BillingNumber']
        bdict[key] = entry
    return bdict

with open('old.csv') as csv_file:
    old = csv.DictReader(csv_file)
    old_bills = make_billing_dict(old)

That results in this data structure for old_bills:

{'2': {'BillingNumber': '2',
       'City': 'Reading',
       'CustomerName': 'CHARLIE RYAN',
       'IsActive': 'Yes',
       'IsCreditHold': 'No',
       'IsPayScan': 'Yes',
       'State': 'PA'},
 '3': {'BillingNumber': '3',
       'City': '',
       'CustomerName': 'INSURANCE BILLS',
       'IsActive': '',
       'IsCreditHold': '',
       'IsPayScan': '',
       'State': ''},
 '4': {'BillingNumber': '4',
       'City': '',
       'CustomerName': 'AAA',
       'IsActive': '',
       'IsCreditHold': '',
       'IsPayScan': '',
       'State': ''}}

Once you create the same data structure for the "new" billing file, you can easily find the differences:

# Keys that are in old_bills, but not new_bills
print set(old_bills.keys()) - set(new_bills.keys())

# Keys that are in new_bills, but not old_bills
print set(new_bills.keys()) - set(old_bills.keys())

# Compare columns for same billing records
# Will print True or False
print old_bills['2']['CustomerName'] == new_bills['2']['CustomerName']
print old_bills['2']['IsActive'] == new_bills['2']['IsActive']

Obviously, you wouldn't write a separate print statement for each comparison. I'm just demonstrating how to use the data structures to find differences. Next, you should write a function to loop through all possible BillingNumbers and check for differences between old and new...but I'll leave that part for you.

Tom Offermann
  • 1,391
  • 12
  • 12
0

Do you have to write this yourself? If this is a programming exercise, all power to you. Otherwise, look for a tool called "diff" which probably exists in some form you already have access to. It's built into many other tools, such as text editors and like vim, emacs and notepad++, and version control systems like subversion mercurial and git.

I recommend you use the established workhorse instead of reinventing the wheel. git diff is a beast.

David Marx
  • 8,172
  • 3
  • 45
  • 66
  • It's just something i'm trying to figure out. They put up a job listing for a new tech guy at work and the person they will hire has to solve this problem. I just wanted to see if I could do it out of curiosity but after trying for about a week I haven't been able to do it. The guy that runs the tech department said it should be pretty easy but it seems more in depth than he originally stated. I just get bored at work sometimes and it was something for me to try but now i'd like to see it work. If it was up to me i'd just load them into Excel and compare. – user2120375 Feb 28 '13 at 17:02
  • They definitely don't want you to build a tool from scratch. – David Marx Feb 28 '13 at 18:10
0

Reading your comment:

It's just something i'm trying to figure out. They put up a job listing for a new tech guy at work and the person they will hire has to solve this problem.

They very well may be looking for some command-line-fu. Something akin to

diff <(awk -F "\"*,\"*" '{print $1,$2,$3,$5}' csv1.csv) <(awk -F "\"*,\"*" '{print $1,$2,$3,$5}' csv2.csv)

A command that will work in bash, using the diff tool, to compare certain columns, selected using awk.

This is clearly not a python based solution. However, this solution does demonstrate the power of simple unix-based tools.

Community
  • 1
  • 1
Wilduck
  • 13,822
  • 10
  • 58
  • 90
  • That is very cool indeed. When I run it it works well, but it puts some strange numbers on the first line before printing the differences: 4965a4966,4967. Also, would this be a fairly tough program to write in python or am I just way behind on my skills? – user2120375 Feb 28 '13 at 17:22
  • It would be fairly easy to replace the part that `awk` is doing with python. However, writing your own `diff` tool could be pretty complicated. Fortunatley, python is "batteries-included" and already contains a differencing library: http://docs.python.org/2/library/difflib.html. The implementation is in python, so you could see if you'd want to re-implement it http://svn.python.org/projects/python/trunk/Lib/difflib.py – Wilduck Feb 28 '13 at 17:26
  • 1
    It sounded a lot easier than it is. I understand what I want it to do but I haven't quite figured out how to implement it with python. I shall continue my efforts. Thanks for all the replies. – user2120375 Feb 28 '13 at 17:31
  • Well, if the replies are helpful, stackoverflow has a method of showing your appreciation (upvotes). I'm sure anyone who was helpful would appreciate an upvote ;). Also, while it may or may not apply in this case, it's good etiquette to "accept" an answer that solves your problem by clicking the check-mark for that answer. I'm glad this was an insightful discussion for you =). – Wilduck Feb 28 '13 at 17:33
  • I apologize and thank you all, I do not have rep to upvote or I surely would. – user2120375 Feb 28 '13 at 17:55
0

Because the requirements of these things have a tendency to spiral I think it would be worth putting the data in a SQLite database.

As the logic for detecting whether a row is deleted or just new can be tricky to implement.

In the below I've presumed that BillingNumber is the id and not to be change.

import sqlite3
con = sqlite3.connect(":memory:")

cursor = con.cursor()
columns = "BillingNumber,CustomerName,IsActive,IsCreditHold,IsPayScan,City,State"
cursor.execute("CREATE TABLE left  (%s);" % columns)
cursor.execute("CREATE TABLE right (%s);" % columns) 

placeholders = ",".join("?" * len(columns.split(',')))

import csv
def reader(filename):
    for (lineno, line) in enumerate(open(filename)):
        if lineno > 0: # skip header
            yield line

def load_table(tablebname, filename):
    for row in csv.reader(reader(filename)):
        cursor.execute("INSERT INTO %s VALUES(%s);" % (tablebname, placeholders), row)

load_table("left",  "left.csv")
load_table("right", "right.csv")

if False:
    print "LEFT"
    for row in cursor.execute("SELECT * from left;"):
        print row[0]

        print "RIGHT"
        for row in cursor.execute("SELECT * from right;"):
            print row

def dataset(tablename, columns):
    for row in cursor.execute("SELECT * from %s;" % tablename):
        yield tuple(row[x] for x in columns)

# To use if raw data required.       
#left_dataset = dataset("left", [0,1,2,4])
#right_dataset = dataset("right", [0,1,2,4])

# COMPARE functions.
def different_rows():
    q = """SELECT left.*, right.* 
    FROM left, right
    WHERE left.BillingNumber = right.BillingNumber
    AND ( left.CustomerName !=  right.CustomerName OR 
          left.IsActive     !=  right.IsActive OR
          left.IsPayScan    !=  right.IsPayScan )
          ;
    """
    for row in cursor.execute(q):
        print "DIFFERENCE", row

def new_rows():
    q = """SELECT right.* 
    FROM right
    WHERE right.BillingNumber NOT IN ( SELECT BillingNumber FROM left)
          ;
    """
    for row in cursor.execute(q):
        print "NEW", row

different_rows()
new_rows()

The OP would have to write different functions to compare the data but I on the whole using SQL might be easier.

sotapme
  • 4,695
  • 2
  • 19
  • 20