2

Recently, I observed that pandas is faster on multiplications. I show you this in an example below. How is this possible on such simple operations? How is this possible at all? The underlying data container within pandas dataframes are numpy arrays.

Measurements

I use arrays/dataframes with shapes (10k, 10k).

import numpy as np
import pandas as pd

a = np.random.randn(10000, 10000)
d = pd.DataFrame(a.copy())
a.shape
(10000, 10000)
d.shape
(10000, 10000)
%%timeit
d * d
53.2 ms ± 333 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
a * a
318 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Observations

pandas is about five to six times faster than numpy to evaluate this simple multiplication. How can this be?

tripleee
  • 175,061
  • 34
  • 275
  • 318
thomas
  • 319
  • 3
  • 9
  • 1
    Does this answer your question? [Numpy / Pandas optimized vector operations](https://stackoverflow.com/questions/55303847/numpy-pandas-optimized-vector-operations) – Joe Jun 17 '20 at 08:38
  • https://stackoverflow.com/questions/17390886/how-to-speed-up-pandas-multilevel-dataframe-sum – Joe Jun 17 '20 at 08:39

1 Answers1

2

Pandas uses numexpr behind the scenes

Pandas uses numexpr under the hood if it is installed. This is true in my case. If I use numexpr explicitly I get the following.

Measurement

With numexpr.evaluate a 'valid' numerical expression on numpy.ndarrays can be evaluated.

import numexpr
%%timeit
numexpr.evaluate('a * a')
52.7 ms ± 398 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Observations

The wall time for evaluating the product of an array with itself is now plus minus the same as the one needed by pandas.

Conclusion

There can be cases where pandas is faster then numpy alone. On the other hand, by using numexpr together with numpy one can get the same speedup. But you need to do it 'your own'. Additionally, this here is not a normal use case for pandas. Usually one has a dataframe with an Index or a MultiIndex (Hierarchical Index) attached on at least one axis. Multiplying dataframes with not equal MultiIndex (broadcasting) for example, needs to be investigated.

thomas
  • 319
  • 3
  • 9