A more efficient way to format strings sequentially?

Question

I am manipulating strings sequentially. However, it looks bulky and may also not be efficient in performance in code execution. Is there a better way to run this maybe in a function?

df=['Apple sauce','Banana & peach','c(&)a']

df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(' ', '')
df.columns = df.columns.str.replace('&','') 
df.columns = df.columns.str.replace('(','')
df.columns = df.columns.str.replace(')','')

Desired Out: df=['applesauce','bananapeach','ca']

Remember that df.str.replace uses regular expressions by default, so you may just use a character group for the replacement. `df.columns.str.replace('[ &()]','') ` — Alex Huszagh, Sep 10 '19 at 00:14
@AlexanderHuszagh thank you so much! How would we incooperate the lower() command in your code? — LaLaTi, Sep 10 '19 at 00:17
You could simply do it in two steps: `df.columns = df.columns.str.lower()` and then `df.columns = df.columns.str.replace('[ &()]','')`. — Alex Huszagh, Sep 10 '19 at 00:21

Massifox · Answer 1 · 2019-09-10T02:22:02.733

Saswat Padhi's solution is very cool, but it's not very efficient. If your problem was efficiency, you can consider my solution(with regex) to be around 2 times faster. This is my code:

import re
columns = df.columns
skipped = '[ &()]'
formatted_columns = [re.sub(skipped, '', col).lower() for col in columns]
df.columns = formatted_columns

Here are the measurements:

1. regex

%%timeit
columns = df.columns
formatted_columns = [re.sub(skipped, '', col).lower() for col in columns]
df.columns = formatted_columns
# 231 µs ± 56.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

2. str.lower() and str.replace()

%%timeit
df.columns = df.columns.str.replace('[ &()]', '').str.lower()
# 483 µs ± 112 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df.columns = df.columns.str.lower().str.replace('[ &()]', '')
# 500 µs ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You can find a beautiful answer Replacing two characters, where you can find many comparisons on execution times.

score 0 · Accepted Answer · answered Sep 10 '19 at 00:32

As discussed in the comments, you could use a regex to simultaneously replace several characters. Additionally, you could also chain the various replacements since both lower and replace return a copy of the object after the appropriate replacement(s):

df.columns = df.columns.str.lower().str.replace('[ &()]', '')

or

df.columns = df.columns.str.replace('[ &()]', '').str.lower()

A more efficient way to format strings sequentially?

2 Answers2

1. regex

2. str.lower() and str.replace()