1

Say I need to collect millions of strings in an iterable that I can later randomly index by position.

I need to populate the iterable one item at a time, sequentially, for millions of entries.

Given the above, which method could in principle be more efficient:

Populating a list:

while <condition>:
  if <condition>:
     my_list[count] = value
     count += 1

Populating a dictionary:

while <condition>:
  if <condition>:
     my_dict[count] = value
     count += 1

(the above is pesudocode, everything would be initialized before running the snippets).

I am specifically interested in the CPython implementation for Python 3.4.

Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564
  • 1
    I would expect any difference to be negligible. Given that you don't have a sparse index, the list makes more sense conceptually, but both list index and dictionary key lookup are `O(1)` (see e.g. https://wiki.python.org/moin/TimeComplexity). Consider a list comp (`[value for value in ]`). – jonrsharpe Oct 15 '14 at 15:51
  • I'm assuming you are pre-sizing the list in your first example – EdChum Oct 15 '14 at 15:52
  • Thanks @jonrsharpe My concern is populating them, not accessing them. – Amelio Vazquez-Reina Oct 15 '14 at 15:52
  • @EdChum, the strings are variable length, although I know the number of strings. How should I pre-size it? – Amelio Vazquez-Reina Oct 15 '14 at 15:53
  • Then the same resource tells you that adding to lists and dictionaries are also both `O(1)`. – jonrsharpe Oct 15 '14 at 15:53
  • Hmm.. I thought there was a `resize()` method but there isn't, spending too long in c++ land: http://stackoverflow.com/questions/311775/python-create-a-list-with-initial-capacity/1602958#1602958 – EdChum Oct 15 '14 at 15:57
  • @jonrsharpe: and when two algorithms have the same complexity, it's time to start the real engineering :p – Fred Foo Oct 15 '14 at 16:09
  • @EdChum `[None] * length` is effectively a way of preallocating space in a list. – Dunes Oct 15 '14 at 16:53
  • @Dunes yes I saw that in the link I gave, for some reason I thought there might be a built in method – EdChum Oct 15 '14 at 16:56
  • They may both be O(1), but the "constant time" of the dict is probably going to be bigger, since at very least it has to apply a hashing function and be able to handle collisions. Also, if your access pattern is sequential you may gain something with a list, since it's more cache friendly (although you are storing pointers anyway, so I wouldn't expect this to weigh much). – Matteo Italia Oct 15 '14 at 16:56

3 Answers3

3

Lists are definitely faster, if you use them in the right way.

In [19]: %%timeit l = []
   ....: for i in range(1000000): l.append(str(i))
   ....: 
1 loops, best of 3: 182 ms per loop

In [20]: %%timeit d = {}
   ....: for i in range(1000000): d[i] = str(i)
   ....: 
1 loops, best of 3: 207 ms per loop

In [21]: %timeit [str(i) for i in range(1000000)]
10 loops, best of 3: 158 ms per loop

Pushing the Python loop down to the C level with a comprehension buys you quite a bit of time. It also makes more sense to prefer a list for keys that are a prefix of the integers. Pre-allocating saves even more time:

>>> %%timeit
... l = [None] * 1000000
... for i in xrange(1000000): my_list[i] = str(i)
... 
10 loops, best of 3: 147 ms per loop

For completeness, a dict comprehension does not speed things up:

In [22]: %timeit {i: str(i) for i in range(1000000)}
1 loops, best of 3: 213 ms per loop

With larger strings, I see very similar differences in performance (try str(i) * 10). This is CPython 2.7.6 on an x86-64.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • If we're doing list comprehensions, shouldn't we also be doing dictionary comprehensions (for completeness, if nothing else)? – NPE Oct 15 '14 at 16:07
  • When you mentioned there isn't a way to preallocate lists, I would like to understand that better. Are you saying that `my_list =N_items * [None]` and then using `my_list[index] = value` does not save me time w.r.t. `my_list.append(value)`? – Amelio Vazquez-Reina Oct 15 '14 at 16:55
  • @user815423426 I completely forgot about that option. Yes, it is faster still. – Fred Foo Oct 15 '14 at 17:12
1

I don't understand why you want to create an empty list or dict and then populate it. Why not create a new list or dictionary directly from the generation process?

results = list(a_generator)

# Or if you really want to use a dict for some reason:
results = dict(enumerate(a_generator))
Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
  • Thanks. I know I can use generators. My question is about the when I choose to/have to populate the container one item at a time, for millions of items, but I certainly appreciate the observation. – Amelio Vazquez-Reina Oct 15 '14 at 16:52
  • 1
    @user815423426 Why do you have to populate your container one at a time? Having looked at your question you could easily use a generator to create those values. – Dunes Oct 15 '14 at 16:56
1

You can get even better times by using the map function:

>>> def test1():
    l = []
    for i in range(10 ** 6):
        l.append(str(i))

>>> def test2():
    d = {}
    for i in range(10 ** 6):
        d[i] = str(i)

>>> def test3():
    [str(i) for i in range(10 ** 6)]

>>> def test4():
    {i: str(i) for i in range(10 ** 6)}

>>> def test5():
    list(map(str, range(10 ** 6)))

>>> def test6():
    r = range(10 ** 6)
    dict(zip(r, map(str, r)))

>>> timeit.Timer('test1()', 'from __main__ import test1').timeit(100)
30.628035710889932
>>> timeit.Timer('test2()', 'from __main__ import test2').timeit(100)
31.093550469839613
>>> timeit.Timer('test3()', 'from __main__ import test3').timeit(100)
25.778271498509355
>>> timeit.Timer('test4()', 'from __main__ import test4').timeit(100)
30.10892986559668
>>> timeit.Timer('test5()', 'from __main__ import test5').timeit(100)
20.633583353028826
>>> timeit.Timer('test6()', 'from __main__ import test6').timeit(100)
28.660790917067914
Noctis Skytower
  • 21,433
  • 16
  • 79
  • 117