Is it really that bad?
In the following I wanted to show you how you can make your code faster. But then I realized, that it depends also on the size of the dataset in use. Nonetheless, let us first have a look at your problem. I will run the same code on my machine in order to gauge the comparison. I will do everything for a big dataset (100 times yours) and a small one (your dataset).
Pandas is slow on some numerical computations. Let's see how slow compared to equivalent numpy operations.
Using pandas 0.23.4 on Linux 32 Cores, within a jupyter notebook
(Using pandas 1.0.4 on Windows 2 Cores for the results at the end, within a jupyter notebook)
Note, that all results have been found within a jupyter notebook. I did not change any settings. It might be, that under real world conditions, the results would differ.
Measurements
In the following my measurements.
Big DataSet
import pandas as pd
import numpy as np
a = np.random.randn(10000, 4000)
df0 = pd.DataFrame(a.copy())
df = df0.copy()
Note, that I use a bit more data, 100 times more. Additionally, I use the magic %%time command for measurement instead of the %%timeit.
df.shape
(10000, 4000)
I run the following cell twice. The first time you run it, the kernel might still load libraries or compile something. It will show different results. But you can assume, that there is no internal state changed or results cached on the DataFrame when performing a simple multiplication (as it does when you perform a groupby and aggregation).
Furthermore, I do not create a copy in each cell as you did. Nonetheless, the following creates a new DataFrame and keeps the old one. It is not only a view on the left hand sides' dataframe.
%%time
_ = df * 1
CPU times: user 78 ms, sys: 90.6 ms, total: 169 ms
Wall time: 24.3 ms
If we assign the resulting DataFrame instance to the df
pointer, the execution of the cell takes longer. Maybe, because the garbage collector frees the DataFrame from the left hand side: There is no reference left in the notebook to this one anymore. So be careful in your performance tests, what you are measuring!
%%time
df = df * 1
CPU times: user 84.4 ms, sys: 94.7 ms, total: 179 ms
Wall time: 31.7 ms
Or with inplace multiplication
%%time
df *= 1
CPU times: user 77.1 ms, sys: 97 ms, total: 174 ms
Wall time: 31 ms
Observations to the above: Note, that total time is higher than wall time (your wall clock or smartphone clock, nowadays). This tells us, that some multiprocessing or concurrent multithreading works in the background.
Let's continue now in how to make things faster. You tried the following, basically:
%%time
df[:] = df.values * 1.
CPU times: user 258 ms, sys: 234 ms, total: 492 ms
Wall time: 491 ms
This is not faster, because the __setitem__
, which is quite sophisticated on pandas.Dataframe
s, is slow. The same you get for the loc
.
%%time
df.loc[:] = df.values * 1.
CPU times: user 260 ms, sys: 224 ms, total: 485 ms
Wall time: 483 ms
Accessing the data directly
You can access the data directly, and set the values. This seems to be faster. (But you might have problems, if you have mixed datatypes in the DataFrame
.)
%%time
df.values[...] = df.values * 1.
CPU times: user 95.7 ms, sys: 78.5 ms, total: 174 ms
Wall time: 173 ms
Or even faster, do everything inplace. (As long as df.values[...]
returns a reference to the data store.)
%%time
df.values[...] *= 1
CPU times: user 43.4 ms, sys: 0 ns, total: 43.4 ms
Wall time: 42.6 ms
Can it be faster than that? Let's compare this with the following multiplications. First by multiplying the initial dataset, the numpyarray a
...
%%time
_ = a * 1
CPU times: user 45.9 ms, sys: 82.7 ms, total: 129 ms
Wall time: 128 ms
... and by performing the corresponding inplace multiplication.
%%time
a *= 1
CPU times: user 43.5 ms, sys: 0 ns, total: 43.5 ms
Wall time: 42.9 ms
It shows, that less then about 43 milliseconds cannot be expected. Therefore, accessing the data directly and operating on it is as fast as operating on numpy arrays directly.
But note, in my example, even the initial quess is faster than that. Showing, that there is some optimization taking place with pandas, which does not with numpy. Strange!
Small Dataset
Here I make the same observations as you did. The trick with accessing the data directly, works out best again (df.values[...] *= 1
).
import numpy as np
import pandas as pd
a = np.random.randn(1000, 400)
df0 = pd.DataFrame(a.copy())
df = df0.copy()
df.shape
(1000, 400)
%%time
_ = df * 1
CPU times: user 4.23 ms, sys: 1.28 ms, total: 5.51 ms
Wall time: 2.83 ms
%%time
df = df * 1
CPU times: user 4.68 ms, sys: 188 µs, total: 4.87 ms
Wall time: 2.22 ms
%%time
df *= 1
CPU times: user 2.66 ms, sys: 1.76 ms, total: 4.42 ms
Wall time: 1.71 ms
%%time
df[:] = df.values * 1.
CPU times: user 4.28 ms, sys: 21 µs, total: 4.3 ms
Wall time: 3.51 ms
%%time
df.loc[:] = df.values * 1.
CPU times: user 3.77 ms, sys: 0 ns, total: 3.77 ms
Wall time: 3.13 ms
%%time
df.values[...] = df.values * 1.
CPU times: user 2.19 ms, sys: 0 ns, total: 2.19 ms
Wall time: 1.38 ms
%%time
df.values[...] *= 1
CPU times: user 211 µs, sys: 1.05 ms, total: 1.26 ms
Wall time: 681 µs
%%time
_ = a * 1
CPU times: user 1.61 ms, sys: 0 ns, total: 1.61 ms
Wall time: 818 µs
%%time
a *= 1
CPU times: user 379 µs, sys: 950 µs, total: 1.33 ms
Wall time: 671 µs
Open Questions
It looks that simple multiplications are sometimes faster with pandas as with numpy. Here for the big dataset from above.
%%timeit
_ = df * df
22.8 ms ± 590 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
_ = a * a
133 ms ± 4.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
It does not matter whether I call timeit or time. The results are the same.
%%time
_ = df * df
CPU times: user 62.3 ms, sys: 99.2 ms, total: 162 ms
Wall time: 23.8 ms
%%time
_ = a * a
CPU times: user 57.6 ms, sys: 82.3 ms, total: 140 ms
Wall time: 139 ms
I did not expect this. And you?
I cross-checked this on Windows 10, 2 Cores with pandas 1.0.4. The results look basically the same. Althought the relative differences are not that big anymore.
%%timeit
df * df
165 ms ± 5.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
a * a
251 ms ± 9.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)