Read CSV and remove duplicate values based on the values in two (out of many) columns

Question

[Using Python3] I have a csv file that I want to read and remove a 'special' case of duplicates. The script should output the deduped csv to a csv, whilst respecting headers.

Best is to explain it by example. The csv-file looks something like this:

ID  Name    HeaderX HeaderY HeaderZ ...
1   A       string  float   string  ...
1   A       string  float   string  ...
1   A       string  float   string  ...
2   A       string  float   string  ...
2   B       string  float   string  ...
3   A       string  float   string  ...
4   B       string  float   string  ...
5   C       string  float   string  ...
6   D       string  float   string  ...
... ...     ...     ...     ...     ...

Here there are duplicate rows for ID=1 and ID=2, however I want to keep all rows of duplicates where the Name's are the same. So in this example I want to keep all instances of ID=1, but remove all instances of ID=2. In other words, remove all rows that are duplicates where Name has more than 1 variants. (Does that make sens?!)

Currently I have the following code (below), based on this thread. However it does exactly the opposite, removing duplicates based on the two columns and leaving all instances of ID=2 and removing rows where ID=1.

Also, ideally I would like the script to print the count of duplicates it removed.

import csv

filename = 'testing.csv'
outfile = 'outfile.csv'

with open(outfile, 'w') as fout:
    writer = None
    entries = set()
    with open(filename, 'r') as fin:
        reader = csv.DictReader(fin)

        if not writer:
            writer = csv.DictWriter(fout, lineterminator='\n', fieldnames=reader.fieldnames)
            writer.writeheader()

        for row in reader:
            key = (row['ID'], row['Name'])

            if key not in entries:
                writer.writerow(row)
                entries.add(key)

falsetru · Accepted Answer · 2013-06-21T11:45:47.097

2

If rows are sorted by IDs, you can use following code.

import csv
import itertools
import operator

filename = 'testing.csv'
outfile = 'outfile.csv'
ndups = 0

with open(filename, 'r') as fin, open(outfile, 'w') as fout:
    reader = csv.DictReader(fin)
    writer = csv.DictWriter(fout, lineterminator='\n', fieldnames=reader.fieldnames)
    for id_, grp in itertools.groupby(reader, key=operator.itemgetter('ID')):
        rows = list(grp)
        if len({row['Name'] for row in rows}) > 1:
            ndups += len(rows)
            continue
        writer.writerows(rows)

print('{} duplicates.'.format(ndups))

edited Jun 21 '13 at 11:45

answered Jun 21 '13 at 11:19

falsetru

357,413
63
732
636

Hi falsetru, it looks like this actually works! In terms of printing the number of duplicates I had something else in mind (simply returning the total number of duplicates), but I'll try to factor that in myself. – Matthijs Jun 21 '13 at 11:41
Actually, I'm not completely sure the rows are always sorted by ID. Would you have a solution for this also? – Matthijs Jun 21 '13 at 11:49
@Matthijs, How large is the csv file? – falsetru Jun 21 '13 at 12:00
2

Replace `for...` line to `for id_, grp in itertools.groupby(sorted(reader, key=operator.itemgetter('ID')), key=operator.itemgetter('ID')):`, then it will work for unsorted data. – falsetru Jun 21 '13 at 12:07
Awesome! Thanks falsetru. The CSV file could be millions of rows actually. – Matthijs Jun 21 '13 at 12:40

Read CSV and remove duplicate values based on the values in two (out of many) columns

1 Answers1