Remove optional characters from a column in pandas

Question

I have a column which may contain values like abc,def or abc,def,efg, or ab,12,34, etc. As you can see, some values end with a , and some don't. What I want to do is remove all such values that end with a comma ,.

Assuming the data is loaded and a data frame is created. So this is what I do

df[c] = df[c].astype('unicode').str.replace("/,*$/", '').str.strip()

But it doesn't do anything.

What am I doing wrong?

Can you post a sample input with expected output to the question? — Ch3steR, Jun 03 '20 at 05:37
Your title says *"remove optional characters"* while your description says *"What I want to do is remove all such values that end with a comma"*. Please clarify. — JvdV, Jun 03 '20 at 05:52
@JvdV the commas are optional in the end and hence the title. — Souvik Ray, Jun 03 '20 at 06:17
This regex is not correct, use `.str.replace(",+$", '').str.strip()` — Ryszard Czech, Jun 03 '20 at 10:09

Mayank Porwal · Accepted Answer · 2020-06-03T06:27:56.030

The way you were trying to do it, would be something like this:

df[c] = df[c].str.rstrip(',')

rstrip(',') will remove comma just from the end of the string.

strip(',') will remove it from start and end positions both.

The above will replace the text. It will not let you drop the rows from the dataframe. So you should do below:

Use str.endswith:

df[~df['col'].str.endswith(',')]

Consider below df:

In [1547]: df
Out[1547]: 
         date id  value  rolling_mean   col
0  2016-08-28  A      1           nan    a,
1  2016-08-28  B      1           nan    b
2  2016-08-29  C      2           nan    c,
3  2016-09-02  B      0          0.50    d
4  2016-09-03  A      3          2.00    ee,ff
5  2016-09-06  C      1          1.50    gg,
6  2017-01-15  B      2          1.00    i,
7  2017-01-18  C      3          2.00    j
8  2017-01-18  A      2          2.50    k,

In [1548]: df = df[~df['col'].str.endswith(',')]    
In [1549]: df                               
Out[1549]: 
         date id  value  rolling_mean    col
1  2016-08-28  B      1           nan      b
3  2016-09-02  B      0          0.50      d
4  2016-09-03  A      3          2.00  ee,ff
7  2017-01-18  C      3          2.00      j

actually your `rstrip(',')` method worked for me. So I will accept this answer. — Souvik Ray, Jun 03 '20 at 07:44

score 1 · Answer 2 · answered Jun 03 '20 at 10:10

1

Your regex is wrong as it contains regex delimiter characters. Python regex uses plain strings, not regex literals.

Use

df[c] = df[c].astype('unicode').str.replace(",+$", '').str.strip()

The ,+$ will match one or more commas at the end of string.

See proof.

Also, see Regular expression works on regex101.com, but not on prod

answered Jun 03 '20 at 10:10

Ryszard Czech

18,032
4
24
37

1

I see. I have been using javascript recently and hence the regex. +1 for your answer. – Souvik Ray Jun 03 '20 at 11:56

Remove optional characters from a column in pandas

2 Answers2