1

Good evening. I have an excel file with zip codes and associated information. Those zip codes have a lot of duplicates. I'd like to figure out which zip codes I have by putting them all in a list without duplicates. This code works, but runs very slowly (took over 100 seconds), and was wondering what I could do to improve the efficiency of it.

I know that having to check the whole list for duplicates each time is contributing a lot to the inefficiency, but I'm not sure how to fix that. I also know that going through every row is probably not the best answer, but again I am pretty new and am now stuck.

Thanks in advance.

import sys
import xlrd

loc = ("locationOfFile")
wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)

def findUniqueZips():
    zipsInSheet = []
    for i in range(sheet.nrows):
        if str(sheet.cell(i,0).value) in zipsInSheet:
            pass
        else:
            zipsInSheet.append(str(sheet.cell(i,0).value))
    print(zipsInSheet)

findUniqueZips()

2 Answers2

2

If you're looking to avoid duplicates then you should definitely consider using Sets in python. See here

What I would do is to create a set and simply add all your elements to a set; note that, a set is an unordered, unique collection of items. Once all data has been added you can then just add all elements in the set it to your sheet. This, therefore, avoids redundant data.



import sys
import xlrd

loc = ("locationOfFile")
wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)

def findUniqueZips():
    zipsInSheet = []
    data = set()

    for i in range(sheet.nrows):
      data.add(str(sheet.cell(i,0).value)

    #now add all elements in the set to your sheet
    for i in range(len(data)):
      zipsInSheet.append(str(sheet.cell(i,0).value))
    print(zipsInSheet)

findUniqueZips()
AzyCrw4282
  • 7,222
  • 5
  • 19
  • 35
1

I usually just convert it to a set. Sets are your friend. They are much faster than lists. Unless you intentionally need or want duplicates, use sets.

https://docs.python.org/3.7/tutorial/datastructures.html?highlight=intersection#sets

rotten
  • 1,580
  • 19
  • 25
  • Yes, if your values are hashable, set's are excellent for deduplicating a list. They do not preserve order though - sometimes that's a disadvantage, but there's a way to work around that which is still fast. – dstromberg Feb 18 '20 at 23:11
  • There is an ordered sets module: https://pypi.org/project/ordered-set/ However I agree you can't use sets for everything, but often you can. For very large sets vs. very large lists they can be more than an order of magnitude faster when you use the "in" clause to check for an item in their content. (even ordered sets are faster than list lookups) – rotten Feb 19 '20 at 01:05