[Using Python3] I have a csv file that I want to read and remove a 'special' case of duplicates. The script should output the deduped csv to a csv, whilst respecting headers.
Best is to explain it by example. The csv-file looks something like this:
ID Name HeaderX HeaderY HeaderZ ...
1 A string float string ...
1 A string float string ...
1 A string float string ...
2 A string float string ...
2 B string float string ...
3 A string float string ...
4 B string float string ...
5 C string float string ...
6 D string float string ...
... ... ... ... ... ...
Here there are duplicate rows for ID=1 and ID=2, however I want to keep all rows of duplicates where the Name's are the same. So in this example I want to keep all instances of ID=1, but remove all instances of ID=2. In other words, remove all rows that are duplicates where Name has more than 1 variants. (Does that make sens?!)
Currently I have the following code (below), based on this thread. However it does exactly the opposite, removing duplicates based on the two columns and leaving all instances of ID=2 and removing rows where ID=1.
Also, ideally I would like the script to print the count of duplicates it removed.
import csv
filename = 'testing.csv'
outfile = 'outfile.csv'
with open(outfile, 'w') as fout:
writer = None
entries = set()
with open(filename, 'r') as fin:
reader = csv.DictReader(fin)
if not writer:
writer = csv.DictWriter(fout, lineterminator='\n', fieldnames=reader.fieldnames)
writer.writeheader()
for row in reader:
key = (row['ID'], row['Name'])
if key not in entries:
writer.writerow(row)
entries.add(key)