update 2021 + speedtest
Starting from pandas 1.4, the equivalent of str.removesuffix, the pandas.Series.str.removesuffix is implemented, so one can use
df['filename'].str.removesuffix('.txt')
speed test
tl;dr: the fastest is
dat["fname"].map(lambda x: x[:-4] if x[-4:] == ".txt" else x)
In the speed test, I wanted to consider the different methods collected in this SO page. I excluded rstrip
, because it would strip other than .txt
endings too, and as regexp contains conditional, therefore it would be fair to modify the other functions too so that they remove the last 4 chars only if they are .txt
.
The testing code is
import pandas as pd
import time
ITER = 10
def rm_re(dat: pd.DataFrame) -> pd.Series:
"""Use regular expression."""
return dat["fname"].str.replace(r'.txt$', '', regex=True)
def rm_map(dat: pd.DataFrame) -> pd.Series:
"""Use pandas map, find occurrences and remove with []"""
where = dat["fname"].str.endswith(".txt")
dat.loc[where, "fname"] = dat["fname"].map(lambda x: x[:-4])
return dat["fname"]
def rm_map2(dat: pd.DataFrame) -> pd.Series:
"""Use pandas map with lambda conditional."""
return dat["fname"].map(lambda x: x[:-4] if x[-4:] == ".txt" else x)
def rm_apply_str_suffix(dat: pd.DataFrame) -> pd.Series:
"""Use str method suffix with pandas apply"""
return dat["fname"].apply(str.removesuffix, args=(".txt",))
def rm_suffix(dat: pd.DataFrame) -> pd.Series:
"""Use pandas removesuffix from version 1.6"""
return dat["fname"].str.removesuffix(".txt")
functions = [rm_map2, rm_apply_str_suffix, rm_map, rm_suffix, rm_re]
for base in range(12, 23):
size = 2**base
data = pd.DataFrame({"fname": ["fn"+str(i) for i in range(size)]})
data.update(data.sample(frac=.5)["fname"]+".txt")
for func in functions:
diff = 0
for _ in range(ITER):
data_copy = data.copy()
start = time.process_time()
func(data_copy)
diff += time.process_time() - start
print(diff, end="\t")
The output is plotted below:

It can be seen from the plot that the slowest solution is the regexp, and the fastest is the pandas.Series.map
with a conditional. In later versions of pandas, this may change and I'd expect an improvement in pandas.Series.str.removesuffix
, as it has a greater potential in vectorization.
Pandas had to be installed from the source as of 2021-11-30, because version 1.4 is in the developement stage only. I installed it by following the instructions from pandas dev repo, by cloning the project and installing with python setup.py install
.
My machine:
- AMD Ryzen 5 2400G with Radeon Vega Graphics, 3.60 GHz
- Windows 10 20H2
- Python 3.10.0, pandas.version '1.4.0.dev0+1267.gaee662a7e3', numpy.version '1.21.4'