Python string methods are extremely fast, and can be used in a list comprehension to fix column names:
# replace white spaces by underscores
df.columns = [c.replace(' ', '_') for c in df]
# strip leading white spaces
df.columns = [c.lstrip() for c in df]
# strip trailing white spaces
df.columns = [c.rstrip() for c in df]
# replace leading white spaces by underscores
df.columns = ['_' + c.lstrip() for c in df]
or map
strip methods:
# strip leading white spaces
df.columns = list(map(str.lstrip, df))
Since pandas' vectorized string methods (pandas.Index.str
and pandas.Series.str
) aren't optimized, using Python string methods in a comprehension is usually faster, especially if you need to chain them.
For example, for 100k column names, if you need to chain 3 methods together, Python string methods are 2-5 times faster than equivalent pandas methods.
n = 100_000
df = pd.DataFrame([range(n)], columns=[f" {i} {j} " for i,j in zip(range(n), range(n, 0, -1))])
%timeit df.set_axis(df.columns.str.replace('^ +', 'S', regex=True).str.replace(' +$', 'E', regex=True).str.replace(' ', '_'), axis=1)
# 331 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.set_axis('S' + df.columns.str.strip().str.replace(' ', '_') + 'E', axis=1)
# 118 ms ± 3.66 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.set_axis(['S' + c.strip().replace(' ', '_') + 'E' for c in df], axis=1)
# 68 ms ± 5.09 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)