How to delete substrings with specific characters in a pandas dataframe?

Question

I have a pandas dataframe that looks like this:

COL

hi A/P_90890 how A/P_True A/P_/93290 are AP_wueiwo A/P_|iwoeu you A/P_?9028k ?
...
 Im  fine, what A/P_49 A/P_0.0309 about you?

The expected result should be:

COL

hi how are you?
...
Im fine, what about you?

How can I remove efficiently from a column and for the full pandas dataframe all the strings that have A/P_?

I tried with this regular expression:

A/P_(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+

However, I do not know if there's a more simpler or robust way of removing all those substrings from my dataframe. How can I remove all the strings that have A/P_ at the beginning?

UPDATE

I tried:

df_sess['COL'] = df_sess['COL'].str.replace(r'A/P(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '')

And it works, however I would like to know if there's a more robust way of doing this. Possibily with a regular expression.

You want to remove `A/P_1` and keep the rest of the string yes? — ababuji, Jul 01 '18 at 14:09
No, I want to remove the full string... in other words I want to remove all the strings that have `A/P_`, and let there the clean ones @Abhishek — anon, Jul 01 '18 at 14:10
so you want to delete the entire row where any column has `A/P_`? — ababuji, Jul 01 '18 at 14:11
Possible duplicate of [How to drop rows from pandas data frame that contains a particular string in a particular column?](https://stackoverflow.com/questions/28679930/how-to-drop-rows-from-pandas-data-frame-that-contains-a-particular-string-in-a-p) — ababuji, Jul 01 '18 at 14:16
post a testable dataframe (with a few columns and rows) and expected result — RomanPerekhrest, Jul 01 '18 at 14:16
No because I dont want to drop rows, I want to drop the tokens @abishek — anon, Jul 01 '18 at 14:20
@anon, `hi how are you?` --> there's no `are` word in your input value — RomanPerekhrest, Jul 01 '18 at 14:27

Ben.T · Accepted Answer · 2018-07-01T15:34:02.770

3

one way could be to use \S* matching all non withespaces after A/P_ and also add \s to remove the whitespace after the string to remove, such as:

df_sess['COL'] = df_sess['col'].str.replace(r'A/P_\S*\s', '')

In you input, it seems there is an typo error (or at least I think so), so with this input:

df_sess = pd.DataFrame({'col':['hi A/P_90890 how A/P_True A/P_/93290 are A/P_wueiwo A/P_|iwoeu you A/P_?9028k ?',
                              'Im fine, what A/P_49 A/P_0.0309 about you?']})
print (df_sess['col'].str.replace(r'A/P_\S*\s', ''))
0            hi how are you ?
1    Im fine, what about you?
Name: col, dtype: object

you get the expected output

edited Jul 01 '18 at 15:34

answered Jul 01 '18 at 15:00

Ben.T

29,160
6
32
54

I am still having a large sequence of spaces, any idea of how to delete them into a single space? – anon Jul 01 '18 at 15:23
1

@anon you can `.str.replace(r'\s+', ' ')` after the `replace(r'A/P_\S*\s', '')` to select all the sequence composed of 1 whitespace or more (`'\s+`) and replace it by one whitespace. – Ben.T Jul 01 '18 at 15:33

score 2 · Answer 2 · answered Jul 01 '18 at 14:52

How about:

(df['COL'].replace('A[/P|P][^ ]+', '', regex=True)
          .replace('\s+',' ', regex=True))

Full example:

import pandas as pd

df = pd.DataFrame({
    'COL': 
    ["hi A/P_90890 how A/P_True A/P_/93290 AP_wueiwo A/P_|iwoeu you A/P_?9028k ?",
    "Im  fine, what A/P_49 A/P_0.0309 about you?"]
})

df['COL'] = (df['COL'].replace('A[/P|P][^ ]+', '', regex=True)
                      .replace('\s+',' ', regex=True))

Returns (oh, there is an extra space before ?):

                        COL
0              hi how you ?
1  Im fine, what about you?

score 2 · Answer 3 · answered Jul 01 '18 at 15:03

Because of pandas 0.23.0 bug in replace() function (https://github.com/pandas-dev/pandas/issues/21159) when trying to replace by regex pattern the error occurs:

df.COL.str.replace(regex_pat, '', regex=True)
...
--->
TypeError: Type aliases cannot be used with isinstance().

I would suggest to use pandas.Series.apply function with precompiled regex pattern:

In [1170]: df4 = pd.DataFrame({'COL': ['hi A/P_90890 how A/P_True A/P_/93290 are AP_wueiwo A/P_|iwoeu you A/P_?9028k ?', 'Im  fine, what A/P_49 A/P_0.0309 about you?']})

In [1171]: pat = re.compile(r'\s*A/?P_[^\s]*')

In [1172]: df4['COL']= df4.COL.apply(lambda x: pat.sub('', x))

In [1173]: df4
Out[1173]: 
                         COL
0           hi how are you ?
1  Im  fine, what about you?

How to delete substrings with specific characters in a pandas dataframe?

3 Answers3