13

in pandas' manual, there is this example about indexing:

In [653]: criterion = df2['a'].map(lambda x: x.startswith('t'))
In [654]: df2[criterion]

then Wes wrote:

**# equivalent but slower**
In [655]: df2[[x.startswith('t') for x in df2['a']]]

can anyone here explain a bit why the map approach is faster? Is this a python feature or this is a pandas feature?

James Bond
  • 7,533
  • 19
  • 50
  • 64

1 Answers1

22

Arguments about why a certain way of doing things in Python "should be" faster can't be taken too seriously, because you're often measuring implementation details which may behave differently in certain situations. As a result, when people guess what should be faster, they're often (usually?) wrong. For example, I find that map can actually be slower. Using this setup code:

import numpy as np, pandas as pd
import random, string

def make_test(num, width):
    s = [''.join(random.sample(string.ascii_lowercase, width)) for i in range(num)]
    df = pd.DataFrame({"a": s})
    return df

Let's compare the time they take to make the indexing object -- whether a Series or a list -- and the resulting time it takes to use that object to index into the DataFrame. It could be, for example, that making a list is fast but before using it as an index it needs to be internally converted to a Series or an ndarray or something and so there's extra time added there.

First, for a small frame:

>>> df = make_test(10, 10)
>>> %timeit df['a'].map(lambda x: x.startswith('t'))
10000 loops, best of 3: 85.8 µs per loop
>>> %timeit [x.startswith('t') for x in df['a']]
100000 loops, best of 3: 15.6 µs per loop
>>> %timeit df['a'].str.startswith("t")
10000 loops, best of 3: 118 µs per loop
>>> %timeit df[df['a'].map(lambda x: x.startswith('t'))]
1000 loops, best of 3: 304 µs per loop
>>> %timeit df[[x.startswith('t') for x in df['a']]]
10000 loops, best of 3: 194 µs per loop
>>> %timeit df[df['a'].str.startswith("t")]
1000 loops, best of 3: 348 µs per loop

and in this case the listcomp is fastest. That doesn't actually surprise me too much, to be honest, because going via a lambda is likely to be slower than using str.startswith directly, but it's really hard to guess. 10 is small enough we're probably still measuring things like setup costs for Series; what happens in a larger frame?

>>> df = make_test(10**5, 10)
>>> %timeit df['a'].map(lambda x: x.startswith('t'))
10 loops, best of 3: 46.6 ms per loop
>>> %timeit [x.startswith('t') for x in df['a']]
10 loops, best of 3: 27.8 ms per loop
>>> %timeit df['a'].str.startswith("t")
10 loops, best of 3: 48.5 ms per loop
>>> %timeit df[df['a'].map(lambda x: x.startswith('t'))]
10 loops, best of 3: 47.1 ms per loop
>>> %timeit df[[x.startswith('t') for x in df['a']]]
10 loops, best of 3: 52.8 ms per loop
>>> %timeit df[df['a'].str.startswith("t")]
10 loops, best of 3: 49.6 ms per loop

And now it seems like the map is winning when used as an index, although the difference is marginal. But not so fast: what if we manually turn the listcomp into an array or a Series?

>>> %timeit df[np.array([x.startswith('t') for x in df['a']])]
10 loops, best of 3: 40.7 ms per loop
>>> %timeit df[pd.Series([x.startswith('t') for x in df['a']])]
10 loops, best of 3: 37.5 ms per loop

and now the listcomp wins again!

Conclusion: who knows? But never believe anything without timeit results, and even then you have to ask whether you're testing what you think you are.

DSM
  • 342,061
  • 65
  • 592
  • 494
  • can you submit this as a PR for docs : https://github.com/pydata/pandas/issues/3871, trying to create a new section – Jeff Sep 21 '13 at 14:25
  • This is probably also the section of the docs where Wes states that [startswith is slower than slicing](http://stackoverflow.com/questions/13270888/why-is-startswith-slower-than-slicing)! – Andy Hayden Sep 21 '13 at 15:20