I know this is long settled, but I have had a closely related problem whereby I was to remove duplicates based on one column. The input csv file was quite large to be opened on my pc by MS Excel/Libre Office Calc/Google Sheets; 147MB with about 2.5 million records. Since I did not want to install a whole external library for such a simple thing, I wrote the python script below to do the job in less than 5 minutes. I didn't focus on optimization, but I believe it can be optimized to run faster and more efficient for even bigger files. The algorithm is similar to @IcyFlame above, except that I am removing duplicates based on a column ('CCC') instead of whole row/line.
import csv
with open('results.csv', 'r') as infile, open('unique_ccc.csv', 'a') as outfile:
# this list will hold unique ccc numbers,
ccc_numbers = set()
# read input file into a dictionary, there were some null bytes in the infile
results = csv.DictReader(infile)
writer = csv.writer(outfile)
# write column headers to output file
writer.writerow(
['ID', 'CCC', 'MFLCode', 'DateCollected', 'DateTested', 'Result', 'Justification']
)
for result in results:
ccc_number = result.get('CCC')
# if value already exists in the list, skip writing it whole row to output file
if ccc_number in ccc_numbers:
continue
writer.writerow([
result.get('ID'),
ccc_number,
result.get('MFLCode'),
result.get('datecollected'),
result.get('DateTested'),
result.get('Result'),
result.get('Justification')
])
# add the value to the list to so as to be skipped subsequently
ccc_numbers.add(ccc_number)