You really want a set
. Sets are faster than lists because they can only contain unique elements, which allows them to be implemented as hash tables. Hash tables allow membership testing (if element in my_set
) in O(1)
time. This contrasts with lists, where the only way to check if an element is in the list is to check every element of the list in turn (in O(n)
time.)
A dict
is similar to a set
in that both allow unique keys only, and both are implemented as hash tables. They both allow O(1)
membership testing. The difference is that a set
only has keys, while a dict
has both keys and values (which is extra overhead you don't need in this application.)
Using a set
, and replacing the nested for loop with an itertools.chain()
to flatten the 2D list to a 1D list:
import itertools
seen = set()
for author in itertools.chain(*authors):
seen.add(author)
Or shorter:
import itertools
seen = set( itertools.chain(*authors) )
Edit (thanks, @jamylak) more memory efficient for large lists:
import itertools
seen = set( itertools.chain.from_iterable(authors) )
Example on a list of lists:
>>> a = [[1,2],[1,2],[1,2],[3,4]]
>>> set ( itertools.chain(*a) )
set([1, 2, 3, 4])
P.S. : If, instead of finding all the unique authors, you want to count the number of times you see each author, use a collections.Counter
, a special kind of dictionary optimised for counting things.
Here's an example of counting characters in a string:
>>> a = "DEADBEEF CAFEBABE"
>>> import collections
>>> collections.Counter(a)
Counter({'E': 5, 'A': 3, 'B': 3, 'D': 2, 'F': 2, ' ': 1, 'C': 1})