How to replace values in a Pandas series s
via a dictionary d
has been asked and re-asked many times.
The recommended method (1, 2, 3, 4) is to either use s.replace(d)
or, occasionally, use s.map(d)
if all your series values are found in the dictionary keys.
However, performance using s.replace
is often unreasonably slow, often 5-10x slower than a simple list comprehension.
The alternative, s.map(d)
has good performance, but is only recommended when all keys are found in the dictionary.
Why is s.replace
so slow and how can performance be improved?
import pandas as pd, numpy as np
df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)})
lst = df['A'].values.tolist()
##### TEST 1 #####
d = {i: i+1 for i in range(1000)}
%timeit df['A'].replace(d) # 1.98s
%timeit [d[i] for i in lst] # 134ms
##### TEST 2 #####
d = {i: i+1 for i in range(10)}
%timeit df['A'].replace(d) # 20.1ms
%timeit [d.get(i, i) for i in lst] # 243ms
Note: This question is not marked as a duplicate because it is looking for specific advice on when to use different methods given different datasets. This is explicit in the answer and is an aspect not usually addressed in other questions.