Efficiently removing duplicates from a list

Question

Good evening. I have an excel file with zip codes and associated information. Those zip codes have a lot of duplicates. I'd like to figure out which zip codes I have by putting them all in a list without duplicates. This code works, but runs very slowly (took over 100 seconds), and was wondering what I could do to improve the efficiency of it.

I know that having to check the whole list for duplicates each time is contributing a lot to the inefficiency, but I'm not sure how to fix that. I also know that going through every row is probably not the best answer, but again I am pretty new and am now stuck.

Thanks in advance.

import sys
import xlrd

loc = ("locationOfFile")
wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)

def findUniqueZips():
    zipsInSheet = []
    for i in range(sheet.nrows):
        if str(sheet.cell(i,0).value) in zipsInSheet:
            pass
        else:
            zipsInSheet.append(str(sheet.cell(i,0).value))
    print(zipsInSheet)

findUniqueZips()

score 2 · Accepted Answer · answered Feb 18 '20 at 23:14

If you're looking to avoid duplicates then you should definitely consider using Sets in python. See here

What I would do is to create a set and simply add all your elements to a set; note that, a set is an unordered, unique collection of items. Once all data has been added you can then just add all elements in the set it to your sheet. This, therefore, avoids redundant data.



import sys
import xlrd

loc = ("locationOfFile")
wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)

def findUniqueZips():
    zipsInSheet = []
    data = set()

    for i in range(sheet.nrows):
      data.add(str(sheet.cell(i,0).value)

    #now add all elements in the set to your sheet
    for i in range(len(data)):
      zipsInSheet.append(str(sheet.cell(i,0).value))
    print(zipsInSheet)

findUniqueZips()

Also, do note that you may need to deduct -1 in the range of the loop if necessary — AzyCrw4282, Feb 18 '20 at 23:37

score 1 · Answer 2 · answered Feb 18 '20 at 23:09

1

I usually just convert it to a set. Sets are your friend. They are much faster than lists. Unless you intentionally need or want duplicates, use sets.

https://docs.python.org/3.7/tutorial/datastructures.html?highlight=intersection#sets

answered Feb 18 '20 at 23:09

rotten

1,580
19
25

Yes, if your values are hashable, set's are excellent for deduplicating a list. They do not preserve order though - sometimes that's a disadvantage, but there's a way to work around that which is still fast. – dstromberg Feb 18 '20 at 23:11
There is an ordered sets module: https://pypi.org/project/ordered-set/ However I agree you can't use sets for everything, but often you can. For very large sets vs. very large lists they can be more than an order of magnitude faster when you use the "in" clause to check for an item in their content. (even ordered sets are faster than list lookups) – rotten Feb 19 '20 at 01:05

Efficiently removing duplicates from a list

2 Answers2