If my understanding is correct, pandas inplace operations involve calling an .update_inplace()
method, so for example, .replace()
would compute the new, replaced data first, then update the dataframe accordingly.
.applymap()
is a wrapper of .apply()
; neither of these come with inplace options, but even if they did, they would still need to store all the output data in memory before modifying the dataframe.
From the source, .applymap()
calls .apply()
, which calls .aggregate()
, which calls _aggregate()
, which calls ._agg()
, which is nothing more than a for loop run in Python (i.e. not Cython -- I think).
You could of course, modify the underlying NumPy array directly: the following code rounds the dataframe in place:
frame = pd.DataFrame(np.random.randn(100, 100))
for i in frame.index:
for j in frame.columns:
val = round(frame.values[i,j])
frame.values[i,j] = val
newvals = np.zeros(frame.shape[1])
for i in frame.index:
for j in frame.columns:
val = round(frame.values[i,j])
newvals[j] = val
frame.values[i] = newvals
The first method sets one element at a time, and takes about 1s, the second sets by row, and takes 100ms; .applymap(round)
does it in 20ms.
However, interestingly, if we use frame = pd.DataFrame(np.random.randn(1, 10000))
, both the first method and .applymap(round)
take about 1.2s, and the second takes about 100ms.
Finally, frame = pd.DataFrame(np.random.randn(10000,1))
has the first and second method taking 1s (unsurprisingly), and .applymap(round)
takes 10ms.
These results more or less show that .applymap
is essentially iterating over each column.
I tried running frame.applymap(round)
with 3 different shapes: (10000,1), (100,100), and (1,10000). The first was fastest, and the third was slowest; this shows that .applymap()
iterates over columns. The following code does roughly the same thing as .applymap()
, in place:
newvals = np.zeros(frame.shape[1])
for i in frame.index:
for j in frame.columns:
val = round(frame.values[i,j])
newvals[j] = val
frame.values[i] = newvals
This one works with a copy of the underlying NumPy array:
newvals = np.zeros(frame.shape[1])
arr = frame.values
for i in frame.index:
for j in frame.columns:
val = round(arr[i,j])
newvals[j] = val
arr[i] = newvals
With a 100x100 dataframe, the former took about 300ms for me to run, and the latter 60ms -- the difference is solely due to having to access .values
in the dataframe!
Running the latter in Cython takes about 34ms, whereas .applymap(round)
does it in 24ms. I have no idea why .applymap()
is still faster here though.
To answer the question: there probably isn't an in-place implementation of .applymap()
; if there was, it would most likely involve storing all the 'applied' values before making the in-place change.
If you want to do an .applymap()
in-place, you could just iterate over the underlying NumPy array. However, this comes at a cost of performance -- the best solution is likely to iterate over the rows or columns: e.g. assign arr=df.values[i]
, apply the function on each element of arr
, modify the dataframe by df.values[i] = arr
, and iterating over all i
.