7

I ran a comparison of several ways to access data in a DataFrame. See results below. The quickest access was from using the get_value method on a DataFrame. I was referred to this on this post.

What I was surprised by is that the access via get_value is quicker than accessing via the underlying numpy object df.values.

Question

My question is, is there a way to access elements of a numpy array as quickly as I can access a pandas dataframe via get_value?

Setup

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(16).reshape(4, 4))

Testing

%%timeit
df.iloc[2, 2]

10000 loops, best of 3: 108 µs per loop

%%timeit
df.values[2, 2]

The slowest run took 5.42 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 8.02 µs per loop

%%timeit
df.iat[2, 2]

The slowest run took 4.96 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 9.85 µs per loop

%%timeit
df.get_value(2, 2)

The slowest run took 19.29 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 3.57 µs per loop

Community
  • 1
  • 1
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • To whom ever down voted this question, I'd appreciate some feedback as to why. Thanks – piRSquared May 20 '16 at 01:23
  • If your bottleneck is single-element access, then you should be accessing more than one element at a time – Eric May 20 '16 at 04:28
  • Also, you may want to check that `x = df.values; %timeit x[2,2]` gives similar results - perhaps `values` is not an attribute but a `property`? – Eric May 20 '16 at 04:30

1 Answers1

3

iloc is pretty general, accepting slices and lists as well as simple integers. In the case above, where you have simple integer indexing, pandas first determines that it is a valid integer, then it converts the request to an iat index, so clearly it will be much slower. iat eventually resolves down to a call to get_value, so naturally a direct call to get_value is going to be fast. get_value itself is cached, so micro-benchmarks like these may not reflect performance in real code.

df.values does return an ndarray, but only after checking that it is a single contiguous block. This requires a few lookups and tests so it is a little slower than retrieving the value from the cache.

We can defeat the caching by creating a new data frame every time. This shows that values accessor is fastest, at least for data of a uniform type:

In [111]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4))
10000 loops, best of 3: 186 µs per loop

In [112]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.values[2,2]
1000 loops, best of 3: 200 µs per loop

In [113]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.get_value(2,2)
1000 loops, best of 3: 309 µs per loop

In [114]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.iat[2,2]
1000 loops, best of 3: 308 µs per loop

In [115]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.iloc[2,2]
1000 loops, best of 3: 420 µs per loop

In [116]: %timeit df = pd.DataFrame(np.arange(16).reshape(4, 4)); df.ix[2,2]
1000 loops, best of 3: 316 µs per loop

The code claims that ix is the most general, and so should be in theory be slower than iloc; it may be that your particular test favours ix but other tests may favour iloc just because of the order of the tests needed to identify the index as a scalar index.

Neapolitan
  • 2,101
  • 9
  • 21
  • In addition to providing a very good answer, I wanted to thank you for providing good technique in decoupling caching from performance checking. Running a df creation as a benchmark followed by tests where df is created from scratch is perfect for determining the time to use each method for look up in live code. – piRSquared May 20 '16 at 06:57