Remove duplicate rows from CSV

Question

I have a CSV file that looks like this

red,75,right
red,344,right
green,3,center
yellow,3222,right
blue,9,center
black,123,left
white,68,right
green,47,left
purple,48,left
purple,988,right
pink,2677,left
white,34,right

I am using Python and am trying to remove rows that have duplicate in cell 1. I know I can achieve this using something like pandas but I am trying to do it using standard python CSV library.

Expected Result is...

red,75,right
green,3,center
yellow,3222,right
blue,9,center
black,123,left
white,68,right
purple,988,right
pink,2677,left

Anyone have an example?

I am removing the pandas tag since you don't want a pandas solution. — ayhan, Aug 04 '16 at 18:03

Alexander · Accepted Answer · 2016-08-05T02:30:20.870

2

You can simply use a dictionary where the color is the key and the value is the row. Ignore the color if it is already in the dictionary, otherwise add it and write the row to a new csv file.

import csv

file_in = 'input_file.csv'
file_out = 'output_file.csv'
with open(file_in, 'rb') as fin, open(file_out, 'wb') as fout:
    reader = csv.reader(fin)
    writer = csv.writer(fout)
    d = {}
    for row in reader:
        color = row[0]
        if color not in d:
            d[color] = row  
            writer.writerow(row)
result = d.values()

result
# Output:
# [['blue', '9', 'center'],
# ['pink', '2677', 'left'],
# ['purple', '48', 'left'],
# ['yellow', '3222', 'right'],
# ['black', '123', 'left'],
# ['green', '3', 'center'],
# ['white', '68', 'right'],
# ['red', '75', 'right']]

And the output of the csv file:

!cat output_file.csv
# Output:
# red,75,right
# green,3,center
# yellow,3222,right
# blue,9,center
# black,123,left
# white,68,right
# purple,48,left
# pink,2677,left

edited Aug 05 '16 at 02:30

answered Aug 04 '16 at 18:27

Alexander

105,104
32
201
196

My original question wasn't very clear, I have updated it with the expected output now – fightstarr20 Aug 04 '16 at 20:51
That works great! How would I output the result as a CSV? – fightstarr20 Aug 04 '16 at 21:35
I am getting iterator should return strings, not bytes – fightstarr20 Aug 04 '16 at 22:51
This works fine for me using Python 2.7.11. Which version are you using? – Alexander Aug 05 '16 at 02:28
All working now, needed to open both files in text mode instead of binary, also added newline='' to the outfile to fix extra line being added in output. – fightstarr20 Aug 05 '16 at 12:01

score 0 · Answer 2 · answered Aug 04 '16 at 18:13

You can try this :

import fileinput

def main():
    seen = set() # set for fast O(1) amortized lookup

    for line in fileinput.FileInput('1.csv', inplace=1):
        cell_1 = line.split(',')[0]
        if cell_1 not in seen: 
            seen.add(cell_1)
            print line, # standard output is now redirected to the file

if __name__ == '__main__':
    main()

Remove duplicate rows from CSV

2 Answers2

Linked