5

I have a list of about 50 strings. I will repeatedly (potentially tens of thousands of times) need to know the position of items in the list. Is it better to use list.index() each time, or create a dictionary mapping each item to its position? (My instinct says to create the dictionary, but I don't know what underlies the list indexing, and it may be superfluous.)

djvg
  • 11,722
  • 5
  • 72
  • 103
PurpleVermont
  • 1,179
  • 4
  • 18
  • 46

3 Answers3

9

list.index() will traverse the list until it finds the item it's looking for, which is a linear-time operation. Looking a string up in a dictionary, by contrast, is a constant-time operation, so the dictionary approach will likely have better performance.

Since your keys are strings and you have relatively few of them, another data structure you might want to explore is the trie.

arshajii
  • 127,459
  • 24
  • 238
  • 287
  • Hmm, I think with a small number of (not especially short) strings, a trie might be slower. My strings will be in the 5-50 character range, so if worst case for a trie is the length of the string, that's as bad as using list.index() if I have at most 50 items in the list. – PurpleVermont Jan 20 '15 at 03:16
  • @PurpleVermont It really depends on your strings. I would expect `list.index()` to be far slower than a trie in general since the former is linear in the *number of strings* whereas the latter is linear in the *string length* (note that each "operation" w.r.t. `list.index()` consists of comparing strings, which is itself linear in the string length). – arshajii Jan 20 '15 at 03:20
  • But a hash should generally be faster than both, I'd think – PurpleVermont Jan 20 '15 at 03:23
  • 1
    @PurpleVermont Again, it depends on your data. There are cases where a trie would be faster, and there are cases where a hash table would be faster. In any case, I doubt a dictionary will not be fast enough for your needs -- the trie idea is just something to think about and something that could help future visitors who are facing a similar problem. – arshajii Jan 20 '15 at 03:38
4

Use a dictionary mapping as opposed to looking up an item in a list. The dictionary mapping uses the hash of each item before it evaluates. The hash comparison is much faster and can be found much more quickly (in constant time) as opposed to seeking through the list and evaluating item by item (which scales in linear time).

You can profile your lookups like this:

import timeit
setup = 'from __main__ import foo_dict, foo_list'

To restrict the comparison for a list only 50 long:

l = list(str(i) for i in range(50))
d = dict((str(i), i) for i in range(50))
def foo_dict(k):
    return d[k]

def foo_list(k):
    return l.index(k)

timeit.repeat('[foo_dict(str(i)) for i in range(50)]', setup)

returns for me:

[20.89474606513977, 23.206938982009888, 22.23725199699402]

and

timeit.repeat('[foo_list(str(i)) for i in range(50)]', setup)

returns:

[47.33547496795654, 47.995683908462524, 46.79590392112732]

The dict lookup is much faster for the string because it uses a hash table, whereas the list lookup for the index is much slower because it has to evaluate each string in it against the string being looked for.

Russia Must Remove Putin
  • 374,368
  • 89
  • 403
  • 331
  • Probably not as big a difference when the list is only 50 elements long of course – PurpleVermont Jan 20 '15 at 03:22
  • Thanks, I accpeted arshajii's answer because it was the first to confirm that in theory the dictionary would be faster, because the list.index operation would be linear time while the dictionary lookup would be constant time, but I do appreciate your "proving" that it is indeed faster in practice. – PurpleVermont Jan 20 '15 at 16:00
  • @PurpleVermont first isn't always best, you want to consider what other people will come across in the future. e.g. I just got this accept this morning: http://stackoverflow.com/questions/19017994/limit-truncate-text-file-by-number-of-lines/27672002#27672002 check out the dates of the answers. – Russia Must Remove Putin Jan 20 '15 at 16:04
  • It was a tough choice since both answer the question - one theoretically and one practically. – PurpleVermont Jan 20 '15 at 16:06
  • @PurpleVermont I have fleshed out my response on the theory. – Russia Must Remove Putin Jan 20 '15 at 16:13
2

The dictionary will be much faster, and it's very fast to create, too:

indexer = dict((v, i) for i, v in enumerate(thelist))

enumerate yields (i, thelist[i]) for i in range(len(thelist)), whence the generator expression "swapping" the tuple (as you need to map content to index, not vice versa).

Note that this will work only if every list item is hashable, but since you say the items are strings, you should be fine.

dict, among other things, rapidly turns an iterable of (key, value) tuples into the corresponding dictionary.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • 1
    It's odd to see `dict` used like that these days - most people would go for a dict-comp :) – Jon Clements Jan 20 '15 at 03:08
  • nice trick on indexing the list. (I'm not accepting this answer because it doesn't apply in my context where both the list and dictionary will be simultaneously built up over time, one entry at a time, although that wasn't stated in the question.) – PurpleVermont Jan 20 '15 at 03:10
  • If mutations during the "build up" are by append or extend, then updating the dictionary is also easy. If the "build up" can change the index of existing list items (e.g by inserting in random places), maintaining the `dict` can be more trouble than it's worth -- depends on the patterns of mutations and accesses! Please edit your Q to reflect your **actual** specs with thoroughness and precision, otherwise you essentially make it impossible for us to help you. – Alex Martelli Jan 20 '15 at 03:34
  • Actually I got the answer(s) I was looking for, which was which would be quicker for lookup. Building is easy in my case and I don't need any advice for that, which is why it is not part of the question. – PurpleVermont Jan 20 '15 at 03:48