Compare two CSV files and search for similar items

Question

So I've got two CSV files that I'm trying to compare and get the results of the similar items. The first file, hosts.csv is shown below:

Path    Filename    Size    Signature
C:\     a.txt       14kb    012345
D:\     b.txt       99kb    678910
C:\     c.txt       44kb    111213

The second file, masterlist.csv is shown below:

Filename    Signature
b.txt       678910
x.txt       111213
b.txt       777777
c.txt       999999

As you can see the rows do not match up and the masterlist.csv is always larger than the hosts.csv file. The only portion that I'd like to search for is the Signature portion. I know this would look something like:

hosts[3] == masterlist[1]

I am looking for a solution that will give me something like the following (basically the hosts.csv file with a new RESULTS column):

Path    Filename    Size    Signature    RESULTS
C:\     a.txt       14kb    012345       NOT FOUND in masterlist
D:\     b.txt       99kb    678910       FOUND in masterlist (row 1)
C:\     c.txt       44kb    111213       FOUND in masterlist (row 2)

I've searched the posts and found something similar to this here but I don't quite understand it as I'm still learning python.

Edit Using Python 2.6

Martijn Pieters · Answer 1 · 2019-01-13T11:32:34.947

The answer by srgerg is terribly inefficient, as it operates in quadratic time; here is a linear time solution instead, using Python 2.6-compatible syntax:

import csv

with open('masterlist.csv', 'rb') as master:
    master_indices = dict((r[1], i) for i, r in enumerate(csv.reader(master)))

with open('hosts.csv', 'rb') as hosts:
    with open('results.csv', 'wb') as results:    
        reader = csv.reader(hosts)
        writer = csv.writer(results)

        writer.writerow(next(reader, []) + ['RESULTS'])

        for row in reader:
            index = master_indices.get(row[3])
            if index is not None:
                message = 'FOUND in master list (row {})'.format(index)
            else:
                message = 'NOT FOUND in master list'
            writer.writerow(row + [message])

This produces a dictionary, mapping signatures from masterlist.csv to a line number first. Lookups in a dictionary take constant time, making the second loop over hosts.csv rows independant from the number of rows in masterlist.csv. Not to mention code that's a lot simpler.

For those using Python 3, the above only needs to have the open() calls adjusted to open in text mode (remove the b from the file mode), and you want to add new line='' so the CSV reader can take control of line separators. You may want to state the encoding to use explicitly rather than rely on your system default (use encoding=...). The master_indices mapping can be built with a dictionary comprehension ({r[1]: i for i, r in enumerate(csv.reader(master))}).

The script together with the example inputs will give the error: "IndexError: list index out of range" — Chubaka, Jul 16 '14 at 00:22
@Chubaka: take into account that the inputs are *comma separated*, not tab separated. The OP only formatted them that way in the question. — Martijn Pieters, Jul 16 '14 at 07:36
Is it possible to implement a solution like this but instead of comparing specific indices compare the whole row contents (assuming you have csv files that are identical with only a few different rows). — ssbsts, May 17 '17 at 20:51
@ssbsts: put all rows from file 1 into a set: `existing = {tuple(r) for r in reader1}` (converting rows to tuples is needed to make them hashable), then test your other file against the `existing` set with `if tuple(row) in existing:`. — Martijn Pieters, May 17 '17 at 21:46

srgerg · Accepted Answer · 2014-04-15T22:37:14.240

17

Edit: While my solution works correctly, check out Martijn's answer below for a more efficient solution.

You can find the documentation for the python CSV module here.

What you're looking for is something like this:

import csv

f1 = file('hosts.csv', 'r')
f2 = file('masterlist.csv', 'r')
f3 = file('results.csv', 'w')

c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)

masterlist = list(c2)

for hosts_row in c1:
    row = 1
    found = False
    for master_row in masterlist:
        results_row = hosts_row
        if hosts_row[3] == master_row[1]:
            results_row.append('FOUND in master list (row ' + str(row) + ')')
            found = True
            break
        row = row + 1
    if not found:
        results_row.append('NOT FOUND in master list')
    c3.writerow(results_row)

f1.close()
f2.close()
f3.close()

edited Apr 15 '14 at 22:37

answered Mar 11 '11 at 04:46

srgerg

18,719
4
57
39

1

This is pretty good. Using csv.DictReader might be clearer too, since you could replace `master_row[1]` with `master_row['signature']`. – chmullig Mar 11 '11 at 04:50
This produces a blank line after every result. – serk Mar 11 '11 at 05:17
The blank line issue is system dependent. If you get a blank line after every result, replace the `f3 = file('results.csv', 'w')` line with `f3 = file('results.csv', 'wb')` – srgerg Mar 11 '11 at 05:36
This works as needed. Easy to read through too! Thanks for the help! – serk Mar 11 '11 at 05:58
1

Why a list comprehension when `masterlist = list(c2)` would do? – Martijn Pieters Apr 15 '14 at 17:08
This is terribly inefficient; looping over all of `masterlist` each time. A mapping from column 1 to index would give you linear processing time instead of quadratic. – Martijn Pieters Apr 15 '14 at 17:15
@serk: the blank lines are a result of the files not being opened in binary mode (use `'wb'` instead of `'w'`). – Martijn Pieters Apr 15 '14 at 17:51
@MartijnPieters You're quite right. I'll fix the list comprehension in my answer and recommend readers check out your solution aswell. – srgerg Apr 15 '14 at 22:36
Using" if hosts_row['Signature'] == master_row['Signature']:" and csv.DictReader will give the following error: KeyError: 'Signature' – Chubaka Jul 16 '14 at 00:21
@srgerg What would go wrong if in the second loop `for master_row in c2:` instead of making a list from c2? – inckka Nov 16 '16 at 10:00
Thank you for your answer. I would add only one thing. For Python3 use open() instead of file() – Amir Md Amiruzzaman May 11 '20 at 02:32

score 4 · Answer 3 · answered Mar 11 '11 at 04:56

Python's CSV and collections module, specifically OrderedDict, are really helpful here. You want to use OrderedDict to preserve the order of the keys, etc. You don't have to, but it's useful!

import csv
from collections import OrderedDict


signature_row_map = OrderedDict()


with open('hosts.csv') as file_object:
    for line in csv.DictReader(file_object, delimiter='\t'):
        signature_row_map[line['Signature']] = {'line': line, 'found_at': None}


with open('masterlist.csv') as file_object:
    for i, line in enumerate(csv.DictReader(file_object, delimiter='\t'), 1):
        if line['Signature'] in signature_row_map:
            signature_row_map[line['Signature']]['found_at'] = i


with open('newhosts.csv', 'w') as file_object:
    fieldnames = ['Path', 'Filename', 'Size', 'Signature', 'RESULTS']
    writer = csv.DictWriter(file_object, fieldnames, delimiter='\t')
    writer.writer.writerow(fieldnames)
    for signature_info in signature_row_map.itervalues():
        result = '{0} FOUND in masterlist {1}'
        # explicit check for sentinel
        if signature_info['found_at'] is not None:
            result = result.format('', '(row %s)' % signature_info['found_at'])
        else:
            result = result.format('NOT', '')
        payload = signature_info['line']
        payload['RESULTS'] = result

        writer.writerow(payload)

Here's the output using your test CSV files:

Path    Filename        Size    Signature       RESULTS
C:\     a.txt   14kb    012345  NOT FOUND in masterlist 
D:\     b.txt   99kb    678910   FOUND in masterlist (row 1)
C:\     c.txt   44kb    111213   FOUND in masterlist (row 2)

Please excuse the misalignment, they are tab separated :)

I'm getting an ImportError: cannot import name OrderedDict. I'm using Python 2.6 and a portable version of python 3. Is OrderedDict specific only to 2.7? — serk, Mar 11 '11 at 05:17
Yes. You can change OrderedDict to dict() and it will work fine. — Mahmoud Abdelkader, Mar 11 '11 at 05:34
You can backport 2.7 OrderedDict to 2.6. The module can be found here: http://hg.python.org/cpython/file/291bc0097cc1/Lib/collections/__init__.py — Mahmoud Abdelkader, Mar 11 '11 at 05:38

score 0 · Answer 4 · answered Mar 11 '11 at 05:17

The csv module comes in handy in parsing csv files. But just for fun, I am simply splitting the input on whitespace to get at the data.

Just parse in the data, build a dict for the data in masterlist.csv with the signature as key and the line number as value. Now, for each row of hosts.csv, we can just query the dict and find out whether or not a corresponding entry exists in masterlist.csv and if so at which line.

#! /usr/bin/env python

def read_data(filename):
        input_source=open(filename,'r')
        input_source.readline()
        return [line.split() for line in input_source]

if __name__=='__main__':
        hosts=read_data('hosts.csv')
        masterlist=read_data('masterlist.csv')
        master=dict()
        for index,data in enumerate(masterlist):
                master[data[-1]]=index+1
        for row in hosts:
                try:
                        found="FOUND in masterlist (row %s)"%master[row[-1]]
                except KeyError:
                        found="NOT FOUND in masterlist"
                line=row+[found]
                print "%s    %s    %s    %s    %s"%tuple(line)

score -1 · Answer 5 · edited Oct 08 '18 at 20:52

I just fixed a small thing in Martijn Pieters code in order to make it work in Python 3, and in this code, I am trying to match the first column elements in the file1 row[0] with the first column elements in file2 row[0].

import csv
with open('file1.csv', 'rt', encoding='utf-8') as master:
    master_indices = dict((r[0], i) for i, r in enumerate(csv.reader(master)))

    with open('file2.csv', 'rt', encoding='utf-8') as hosts:
        with open('result.csv', 'w') as results:    
            reader = csv.reader(hosts)
            writer = csv.writer(results)

            writer.writerow(next(reader, []) + ['RESULTS'])

            for row in reader:
                index = master_indices.get(row[0])
                if index is not None:
                    message = 'FOUND in master list (row {})'.format(index)
                    writer.writerow(row + [message])

                else:
                     message = 'NOT FOUND in master list'
                     writer.writerow(row + [message])

        results.close()

Compare two CSV files and search for similar items

5 Answers5

Linked