I need to read from a CSV file, and write to a new CSV file with 0 duplicates

Question

I have a CSV file with some data, I need to write to the a new CSV but I can't have duplicate entries.

I have solved the writing part but I have not been able to solve the duplicate part. I have so far tried a nested loop but with 0 success.

This works but has duplicates

with open('somefile.csv', 'w') as csvfile:
        filewriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
        filewriter.writerow(['Data', 'MoreData', 'EvenMoreData'])

        for row in rows:
    # parsing each column of a row  
            filewriter.writerow([row[3], row[4], row[2]])

Where everything goes wrong

   for row in rows:
    # parsing each column of a row  
            for copy in rows:
                if row[3] != copy[3] and row[2] != copy[2]:
                    filewriter.writerow([copy[3], copy[4], copy[2]])

Tip: Use Pandas https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-in-python-pandas — OneCricketeer, Apr 25 '19 at 19:57
what do you want to happen if there are duplicate entries? Do you take the first entry or the second? — Tom Lubenow, Apr 25 '19 at 20:01

score 0 · Answer 1 · answered Apr 25 '19 at 19:59

0

set

By using a set instead of a list will eliminate duplicates.

for row in set(rows):
  ...

in this case its probably a list of lists, so it might be also might be in your interest to use set(row) if you want unique data per row.

answered Apr 25 '19 at 19:59

Skarlett

762
1
10
22

score 0 · Accepted Answer · answered Apr 25 '19 at 20:01

You can use a set of tuples of keys (row[2] and row[3] in your case) to keep track of keys you have already seen:

seen = set()
for row in rows:
    if (row[2], row[3]) not in seen:
        seen.add((row[2], row[3]))
        filewriter.writerow([row[3], row[4], row[2]])

score 0 · Answer 3 · answered Apr 25 '19 at 20:18

You can import it into pandas, drop duplicates, then export a new csv:

import pandas pd

df = pd.read_csv('my_csv.csv')
df.drop_duplicates(keep=False, inplace=True)
df.to_csv('my_csv_fixed.csv')

The above will add an index column. If you don't want it, index the first (0) column, or any you would like:

df = pd.read_csv('my_csv.csv', index_col=0)

Also, if you prefer tabs as the delimiter, export with the sep keyword argument:

df.to_csv('my_csv_fixed.csv', sep='\t')

I need to read from a CSV file, and write to a new CSV file with 0 duplicates

3 Answers3