Should I use dict or list?

Question

I would like to loop through a big two dimension list:

authors = [["Bob", "Lisa"], ["Alice", "Bob"], ["Molly", "Jim"], ... ]

and get a list that contains all the names that occurs in authors.

When I loop through the list, I need a container to store names I've already seen, I'm wondering if I should use a list or a dict:

with a list:

seen = []
for author_list in authors:
    for author in author_list:
        if not author in seen:
            seen.append(author)
result = seen

with a dict:

seen = {}
for author_list in authors:
    for author in author_list:
        if not author in seen:
            seen[author] = True
result = seen.keys()

which one is faster? or is there better solutions?

Why don't you profile and/or time it to see which one is faster? — Niek de Klein, May 10 '12 at 08:08
making it a `set` ought to make it faster than a list for lookups. It also ought to use less memory than a dict. But don't take my word for it, try it out. — Mattias Nilsson, May 10 '12 at 08:10
If you also care about the performance of lookups, lookups in lists are O(n), while lookups in dictionaries are amortised O(1).. more info here: http://stackoverflow.com/questions/513882/python-list-vs-dict-for-look-up-table — Thanasis Petsas, May 10 '12 at 08:11
@NiekdeKlein I'm not just looking for the result, but also the analytics of why one is faster than another — wong2, May 10 '12 at 08:14
@thg435 : or any of the million `result = set(itertools.chain.from_iterable(authors))` answers below. ;) — Li-aung Yip, May 10 '12 at 09:01
@Li-aungYip: `itertools` is fantastic, but totally confusing to beginners. — georg, May 10 '12 at 09:05
@thg435: Personally I find chained list comprehensions a bit hard to follow too. :) — Li-aung Yip, May 10 '12 at 09:11

Li-aung Yip · Accepted Answer · 2012-05-10T08:38:55.853

You really want a set. Sets are faster than lists because they can only contain unique elements, which allows them to be implemented as hash tables. Hash tables allow membership testing (if element in my_set) in O(1) time. This contrasts with lists, where the only way to check if an element is in the list is to check every element of the list in turn (in O(n) time.)

A dict is similar to a set in that both allow unique keys only, and both are implemented as hash tables. They both allow O(1) membership testing. The difference is that a set only has keys, while a dict has both keys and values (which is extra overhead you don't need in this application.)

Using a set, and replacing the nested for loop with an itertools.chain() to flatten the 2D list to a 1D list:

import itertools
seen = set()
for author in itertools.chain(*authors):
    seen.add(author)

Or shorter:

import itertools
seen = set( itertools.chain(*authors) )

Edit (thanks, @jamylak) more memory efficient for large lists:

import itertools
seen = set( itertools.chain.from_iterable(authors) )

Example on a list of lists:

>>> a = [[1,2],[1,2],[1,2],[3,4]]
>>> set ( itertools.chain(*a) )
set([1, 2, 3, 4])

P.S. : If, instead of finding all the unique authors, you want to count the number of times you see each author, use a collections.Counter, a special kind of dictionary optimised for counting things.

Here's an example of counting characters in a string:

>>> a = "DEADBEEF CAFEBABE"
>>> import collections
>>> collections.Counter(a)
Counter({'E': 5, 'A': 3, 'B': 3, 'D': 2, 'F': 2, ' ': 1, 'C': 1})

@jamylak: Yes, I always forget about `from_iterable()`. The `*a` unpacking syntax comes more naturally to me (though `from_iterable()` is lazy and thus probably uses less memory / is faster.) — Li-aung Yip, May 10 '12 at 08:29

score 3 · Answer 2 · answered May 10 '12 at 08:15

3

set should be faster.

>>> authors = [["Bob", "Lisa"], ["Alice", "Bob"], ["Molly", "Jim"]]
>>> from itertools import chain
>>> set(chain(*authors))
set(['Lisa', 'Bob', 'Jim', 'Molly', 'Alice'])

answered May 10 '12 at 08:15

mshsayem

17,557
11
61
69

score 3 · Answer 3 · answered May 10 '12 at 08:15

3

using a dict or a set is way faster then using a list

import itertools
result = set(itertools.chain.from_iterable(authors))

answered May 10 '12 at 08:15

mata

67,110
10
163
162

I always forget about `from_iterable(a)` (I say `*a` instead.) – Li-aung Yip May 10 '12 at 08:20

score 2 · Answer 4 · answered May 10 '12 at 08:13

2

You can use set -

from sets import Set

seen = Set()

for author_list in authors:
    for author in author_list:
        seen.add(author)

result = seen

This way you are escaping the "if" checking, hence solution would be faster.

answered May 10 '12 at 08:13

theharshest

7,767
11
41
51

What is the benefit of import ? Why wouldn't you use the builtin 'set' itself. ? – sateesh May 10 '12 at 08:17
`set` is a native data type in Python 2.6 and up. [The `sets` module is deprecated.](http://docs.python.org/library/sets.html) – Li-aung Yip May 10 '12 at 08:18

score 1 · Answer 5 · edited May 23 '17 at 11:52

1

If you care about the performance of lookups, lookups in lists are O(n), while lookups in dictionaries are amortised to O(1).

You can find more information here.

edited May 23 '17 at 11:52

Community

1
1

answered May 10 '12 at 08:16

Thanasis Petsas

4,378
5
31
57

score 1 · Answer 6 · answered May 10 '12 at 08:35

Lists just store a bunch of items in a particular order. Think of your list of authors as a long line of pigeonhole boxes with author's names on bits of papers in the boxes. The names stay in the order you put them in, and you can find the author in any particular pigeonhole very easily, but if you want to know if a particular author is in any pigeonhole, then you have to look through each one until you find the name you're after. You can also have the same name in any number of pigeonholes.

Dictionaries are a bit more like a phone book. Given the author's name, you can very quickly check to see whether the author is listed in the phone book, and find the phone number listed with it. But you can only include each author once (with exactly one phone number), and you can't put the authors in there in any order you like, they have to be in the order that makes sense for the phone book. In a real phone book, that order is alphabetical; in Python dictionaries the order is completely unpredictable (and it changes when you add or remove things to the dictionary), but Python can find entries even faster in a dictionary than it could in a phone book.

Sets, on the other hand, are like phone books that just list names, not phone numbers. You still can't have the same name included more than once, it's either in the set or not. And you still can't use the order in which names are in the set for anything useful. But you can very quickly check whether a name is in the set.

Given your use case, a set would appear to be the obvious choice. You don't care about the order in which you've seen authors, or how many times you've seen each author, only that you can quickly check whether you've seen a particular author before.

Lists are bad for this case; they go to the effort of remembering duplicates in whatever order you specify, and they're slow to search. But you also don't have any need to map keys to values, which is what a dictionary does. To go back to the phone book analogy, you don't have anything equivalent to a "phone number"; in your dictionary example you're doing the equivalent of writing a phone book in which everybody's number is listed as True, so why bother listing the phone numbers at all?

A set, OTOH, does exactly what you need.

Why wouldn't you compare a `dict` to a *dictionary*? Upvoted anyway. :P — Li-aung Yip, May 10 '12 at 09:04
@Li-aungYip: Good question! I guess phone numbers just sprung to mind more readily than as being value-ish than word definitions? Plus dictionaries often do have multiple entries for the one word... But really I'm just reaching for justifications here. — Ben, May 10 '12 at 13:12

Should I use dict or list?

6 Answers6

Linked