Is there a way in python/pandas to remove a particular set of characters from a string

Question

Is there a way to remove a particular set of characters from python string in one go?

str='23.889,45 €'

I want to remove dot '.' and '€' sign, but I do not want to use replace() function two times like str.replace('€','').replace('.',''), whereby replacing the characters with white space.

In SAS there is a function compress which takes a list of characters to be removed and on applying that function all the characters present in a SAS string will be removed. For eg: compress(str,'.€') will return str as 23889,45.

Is there a corresponding function in Python as well?

Use a regex: `df['col'].str.replace(r'€|\.',"", regex=True, inplace=True)` — Wiktor Stribiżew, Sep 26 '17 at 13:58
You do realise that the `compress` function you are talking about handles it in the very same way ?? — mrid, Sep 26 '17 at 13:58
@mrid Yes, it does. I just checked it once again to be sure of that. `compress(str,'.€')` would indeed remove all instances of dot and Euro sign and the string we shall finally obtain will be bereft of these aforementioned characters. — cph_sto, Sep 26 '17 at 14:02
@Wiktor Stribiżew Your code works perfectly. Thanks so much for that. But, I am still at pains to understand the syntax. Also, please put this code in the answer, so that others can benefit from it. Many thanks. — cph_sto, Sep 27 '17 at 13:33
@WiktorStribiżew Would there be a way where we could also specify the replacement? I mean in this case `€` be replaced with `$` and `,` with `.`? So in total from '23.889,45 €' we get '23889.45 $' — cph_sto, Sep 28 '17 at 07:15
I have added a replacement solution you may use with Pandas. — Wiktor Stribiżew, Sep 28 '17 at 07:33

Wiktor Stribiżew · Accepted Answer · 2017-09-28T07:32:45.560

Multiple char removal

You may use a regex to perform multiple character replacement.

The construct you are interested in can be a character class or a grouping with alternation.

Character classes are [...] with characters, character ranges or shorthand character classes inside them, and alternation groups are (...|....|.....) like patterns. There may be a problem with using literal chars in both constructs, but re.escape comes to rescue: it will make sure the chars you pass to the regex are treated as literal chars.

See a Python 3 demo:

>>> import re
>>> charsToRemove = ["$", ".", "€"]
>>> s='23.889,45 €'
>>> print(re.sub("|".join([re.escape(x) for x in charsToRemove]), "", s)) # Alternation group
23889,45 
>>> print(re.sub(r"[{}]+".format("".join([re.escape(x) for x in charsToRemove])), "", s)) # Character class
23889,45

In Pandas, you'd use

df['col'].str.replace(r"[{}]+".format("".join([re.escape(x) for x in charsToRemove])),"", regex=True, inplace=True)

Note that the character class approach ([...]+) will work faster.

Multiple replacements

You may consider creating a dictionary of replacements and then use it with Pandas replace:

>>> from pandas import DataFrame
>>> import pandas as pd
>>> import regex
>>> repl_list = {'€':'$', ',':'.', r'\.': ''}
>>> col_list = ['23.889,45 €']
>>> frame = pd.DataFrame(col_list, columns=['col'])
>>> frame['col'].replace(repl_list, regex=True, inplace=True)
>>> frame['col']
0    23889.45 $

To make it work, you must use regex=True argument and add import re as all the keys in repl_list are regular expressions. Do not forget to escape special regex chars in there. See What special characters must be escaped in regular expressions? Or, you may write r'\.' as re.escape('.').

I am very thankful to you for your time and efforts. I was literally struck with this problem. The links you provided were very useful and I suppose one can get a goo understanding for regular expressions. I tested them here [regex](https://regex101.com/r/yS7lG7/1). I noticed that if you change the order in the repl_list for the 2nd and 3rd element, then the results could be different. So, it means that replacement is done sequentially. It solves my problem completely. Many many thanks Wiktor for your support. — cph_sto, Sep 28 '17 at 07:53
@OliverS Glad it worked. I think you can remove one of the duped comments :) — Wiktor Stribiżew, Sep 28 '17 at 08:16

score 0 · Answer 2 · answered Sep 26 '17 at 14:09

0

The compress function you are talking about must be doing something like this:

str='23.889,45 €'

charsToRemove = ["$", ".", "€"]

def compress(str, charsToRemove):
    for i in range(len(charsToRemove)):
        str = str.replace(charsToRemove[i], '')
    return str

print compress(str, charsToRemove) # returns '23889,45 '

answered Sep 26 '17 at 14:09

mrid

5,782
5
28
71

Yes, you are right. This is the way compress would be functioning. But the moot idea behind posting this question was to find a function where I could avoid using multiple `replace` functions, either in a for loop or side by side. – cph_sto Sep 26 '17 at 14:13
The reason I was asking for a function specifically handling such a thing was that because a dedicated python function written would be written in the lower language `C` and that would do the same thing a lot faster than using multiple replace functions. – cph_sto Sep 26 '17 at 14:17
The problem was solved. I wish to thank you for your efforts. Many many thanks. Regards! – cph_sto Sep 28 '17 at 07:46

Is there a way in python/pandas to remove a particular set of characters from a string

2 Answers2