Python: Removing duplicate CSV entries

Question

I have a CSV file with multiple entries. Example csv:

user, phone, email
joe, 123, joe@x.com
mary, 456, mary@x.com
ed, 123, ed@x.com

I'm trying to remove the duplicates by a specific column in the CSV however with the code below I'm getting an "list index out of range". I thought by comparing row[1] with newrows[1] I would find all duplicates and only rewrite the unique entries in file2.csv. This doesn't work though and I can't understand why.

f1 = csv.reader(open('file1.csv', 'rb'))
    newrows = []
    for row in f1:
        if row[1] not in newrows[1]:
            newrows.append(row)
    writer = csv.writer(open("file2.csv", "wb"))
    writer.writerows(newrows)

My end result is to have a list that maintains the sequence of the file (set won't work...right?) which should look like this:

user, phone, email
joe, 123, joe@x.com
mary, 456, mary@x.com

I think a database would be really useful here. Python works OOB with SQLite, you know? — NullUserException, Oct 07 '11 at 03:37
when something doesn't work, you need always need to describe what did happen. Is it an error? is it nothing? is it the wrong thing? — Winston Ewert, Oct 07 '11 at 03:38
I did... my code currently produces a list index out of range. This doesn't make sense as it is searching the second "column" with row[1]. — serk, Oct 07 '11 at 03:40
@serk, my bad missed that. Still, you should generally include the complete error along with line number and traceback. — Winston Ewert, Oct 07 '11 at 03:42

Winston Ewert · Accepted Answer · 2015-10-29T15:14:06.780

9

row[1] refers to the second column in the current row (phone). That's all well in good.

However, you newrows.append(row) add the entire row to the list.

When you check row[1] in newrows you are checking the individual phone number against a list of complete rows. But that's not what you want to do. You need to check against a list or set of just phone numbers. For that, you probably want to keep track of the rows and a set of the observed phone numbers.

Something like:

f1 = csv.reader(open('file1.csv', 'rb'))
writer = csv.writer(open("file2.csv", "wb"))
phone_numbers = set()
for row in f1:
    if row[1] not in phone_numbers:
        writer.writerow(row)
        phone_numbers.add( row[1] )

edited Oct 29 '15 at 15:14

answered Oct 07 '11 at 03:41

Winston Ewert

44,070
10
68
83

This worked. Thanks! I thought `set` would not work because it would not keep the order of the output correct. Would this still work if I used the `newrows` list and just modified the `append(row)` to `writerow`? – serk Oct 07 '11 at 04:03
@serk, set works because I don't ever care about the order. I only ever check whether stuff is in it. Yes, appending to the list will work just fine. – Winston Ewert Oct 07 '11 at 04:10
1

By the way if you are infact removing duplicate phone numbers for say a telemarking database, you have the potential to really wasting away your call lists that way. AND at times the phone numbers that are later in your reader file will be **newer** and more accurate phone numbers then the same phone numbers you've encountered earlier in the reader file, as they were written to the file eariler in time then the same phone number later in the reader file. – DevPlayer Oct 07 '11 at 05:56
`write.writerow(row)` should be `writer.writerow(row)` – Learner Oct 29 '15 at 07:04

score 0 · Answer 2 · answered Aug 22 '18 at 16:26

This solution worked for me but since I'm new to Python, can someone explain the last bit

writer.writerow(row)
phone_numbers.add( row[1] )

I understand that writer looks at the file "file2.csv" and writes each row. What phone_numbers.add( row[1] ) does to eliminate duplicates?

score 0 · Answer 3 · answered May 06 '20 at 13:18

0

I know this is long gone, but in case someone else lands here, please find a better answer here.

answered May 06 '20 at 13:18

Felix Ongati

391
3
7

Python: Removing duplicate CSV entries

3 Answers3

Linked