0

I am manipulating strings sequentially. However, it looks bulky and may also not be efficient in performance in code execution. Is there a better way to run this maybe in a function?

df=['Apple sauce','Banana & peach','c(&)a']

df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(' ', '')
df.columns = df.columns.str.replace('&','') 
df.columns = df.columns.str.replace('(','')
df.columns = df.columns.str.replace(')','')

Desired Out: df=['applesauce','bananapeach','ca']
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
LaLaTi
  • 1,455
  • 3
  • 18
  • 31
  • Remember that df.str.replace uses regular expressions by default, so you may just use a character group for the replacement. `df.columns.str.replace('[ &()]','') ` – Alex Huszagh Sep 10 '19 at 00:14
  • @AlexanderHuszagh thank you so much! How would we incooperate the lower() command in your code? – LaLaTi Sep 10 '19 at 00:17
  • 1
    You could simply do it in two steps: `df.columns = df.columns.str.lower()` and then `df.columns = df.columns.str.replace('[ &()]','')`. – Alex Huszagh Sep 10 '19 at 00:21

2 Answers2

1

Saswat Padhi's solution is very cool, but it's not very efficient. If your problem was efficiency, you can consider my solution(with regex) to be around 2 times faster. This is my code:

import re
columns = df.columns
skipped = '[ &()]'
formatted_columns = [re.sub(skipped, '', col).lower() for col in columns]
df.columns = formatted_columns

Here are the measurements:

1. regex

%%timeit
columns = df.columns
formatted_columns = [re.sub(skipped, '', col).lower() for col in columns]
df.columns = formatted_columns
# 231 µs ± 56.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

2. str.lower() and str.replace()

%%timeit
df.columns = df.columns.str.replace('[ &()]', '').str.lower()
# 483 µs ± 112 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df.columns = df.columns.str.lower().str.replace('[ &()]', '')
# 500 µs ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You can find a beautiful answer Replacing two characters, where you can find many comparisons on execution times.

Massifox
  • 4,369
  • 11
  • 31
0

As discussed in the comments, you could use a regex to simultaneously replace several characters. Additionally, you could also chain the various replacements since both lower and replace return a copy of the object after the appropriate replacement(s):

df.columns = df.columns.str.lower().str.replace('[ &()]', '')

or

df.columns = df.columns.str.replace('[ &()]', '').str.lower()
Saswat Padhi
  • 6,044
  • 4
  • 20
  • 25