There are more possible solutions and approaches to solve this problem.
Most people (and on SO as well) agree that using a dict is the right way.
steveb here for example. :D
Some would argue that a set() would be more convenient and natural way, but most tests I saw and I did myself show that, for some reason, using a dict() is slightly faster. As for why, nobody really knows. Also this may difer from Python version to Python version.
Dictionaries and sets use hashes to access data and that makes them faster than lists ( O(1) ). To check whether an item is in a list, an iteration is performed over a list, and in worst case number of iterations grow with the list.
To learn more on the subject, I suggest you to examine related questions, especially the one mentioned as possible duplicate.
So, I agree with steveb and propose the following code:
chkdict = {} # A dictionary that we'll use to check for existance of an entry (whether is extension already processed or not)
setdef = chkdict.setdefault # Extracting a pointer of a method out of an instance may lead to faster access, thus improving performance a little
# Recurse through a directory:
for root, dirs, files in os.walk("ymir work"):
# Loop through all files in currently examined directory:
for file in files:
ext = path.splitext(file) # Get an extension of a file
# If file has no extension or file is named ".bashrc" or ".ds_store" for instance, then ignore it, otherwise write it to x:
if ext[0] and ext[1]: ext = ext[1].lower()
else: continue
if not ext in chkdict:
# D.setdefault(k[, d]) does: D.get(k, d), also set D[k] = d if k not in D
# You decide whether to use my method with dict.setdefault(k, k)
# Or you can write ext separately and then do: chkdict[ext] = None
# Second solution might even be faster as setdefault() will check for existance again
# But to be certain you should run the timeit test
x.write("\t\"%s\"\n" % setdef(ext, ext))
#x.write("\t\"%s\"\n" % ext)
#chkdict[ext] = None
del chkdict # If you're not inside a function, better to free the memory as soon as you can (if you don't need the data stored there any longer)
I use this algorithm on large amount of data and it performs very well.