0

I have two csv files formatted the same way (two columns of data):

    Name        Link
    Name        Link
    Name        Link
    Name        Link

The only difference between them is the data within those two columns (different names and different links). I'd like to find the names and links that appear in both csv files and write them to a new csv file. So far I've tried:

    import csv

    f1 = file('/path/to/f1.csv', 'r')
    f2 = file('/path/to/f2.csv', 'r')
    f3 = file('/path/to/f3.csv', 'w')

    c1 = csv.reader(f1)
    c2 = csv.reader(f2)
    c3 = csv.writer(f3)

    masterlist = [row for row in c2]

    for hosts_row in c1:
        row = 1
        found = False
        for master_row in masterlist:
            results_row = hosts_row
            if hosts_row[3] == master_row[1]:
                results_row.append('FOUND in master list (row ' + str(row) + ')')
                found = True
                break
            row = row + 1
        if not found:
            results_row.append('NOT FOUND in master list')
        c3.writerow(results_row)

    f1.close()
    f2.close()
    f3.close()

This is based on an answer to a similar question, however I realize the format of the csv files in that case is different. And so I get this error:

         masterlist = [row for row in c2]
    _csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

How do I adjust the above code to fit the format for my csv files. Or is there a better way to do this? Any help would greatly be appreciated as I'm just starting with python and I don't think I've completely grasped the concept of comparing data in two files yet.

Community
  • 1
  • 1
JFC
  • 29
  • 1
  • 5
  • Are you sure about your code ? You have three times f1= and fs_checkins* are not defined. – Dvx Apr 05 '13 at 19:59
  • @MrDave Whoops, forgot to make those changes when transferring my original code into this post. – JFC Apr 05 '13 at 20:19

3 Answers3

1
l1 = set(open('f1.csv'))
l2 = set(open('f2.csv'))
open('f3.csv', 'wb').writelines(l1 & l2)

l1 and l2 are sets of the lines in f1.csv and f2.csv respectively. l1 & l2 evaluates to the set intersection which are the lines found in both files and outputs them to f3.csv.

Björn Lindqvist
  • 19,221
  • 20
  • 87
  • 122
  • Found a simple solution using sets and `&` to find the intersections, as suggested by @Björn Lindqvist. The only differences in my code was I called the `.readlines()` method on the file opener and joined `/n` for each item so that the data wasn't all entered into only one column in the csv: `f3 = file('/path/to/f3.csv', 'w') f1 = set(open('/path/to/f1.csv').readlines()) f2 = set(open('/path/to/f2.csv').readlines()) similarities = f1 & f2 f3.write('\n'.join(similarities)) f3.close()` – JFC Apr 05 '13 at 21:19
0

How big is your files? can you load both of them in memory. The code above loads one of them. And because you are interested in whole row (I guess), You don't have to comapre contents within row.

You dont need csv reader too.

so, try

f1 = open('/path/to/f1.csv', 'r').readlines()
f2 = open('/path/to/f2.csv', 'r').readlines()
f3 = open('/path/to/f3', 'a')




for lines in f1:
    if lines in f2:
          f3.write(lines)


f1.close()
f2.close()
f3.close()
Ananta
  • 3,671
  • 3
  • 22
  • 26
  • you can only read f2 as a whole and iterate over lines in f1. It would save some memory. – Dvx Apr 05 '13 at 19:57
0

Depending on the size of the file, it could make sense to use a dictionary to find out if host_rows[3] was already found in master_row. Your algorithm complexity would drop from a N1xN2 to N1 since dictionnary acces is O(1)

It would be something like this :

master_dct = {}
for master_row in master_list :
    master_dct[master_row[3]] = None
results_row = []
for hosts_row in c1 :
    if hosts_row[1] in master_dct :
        results_row.append(hosts_row)
Dvx
  • 289
  • 2
  • 10