In a pandas string column, eliminate the text preceding a substring

Question

For example I have a Pandas DataFrame with a string column in which I would like to delete the **bold** text before a substring:

Column1
**Yon-RM-**CT 500M
**Abib-RM-**CT 500M
**Wal-RM-**CT 500M
**Sopxc-RM-**CT 1000M

Notice that the bold text could have different length but the substring ends in “-RM-“.

Please also provide an example of what you expect the result to look like, it's not clear from your description. What have you tried yourself, what problems did you run into? https://stackoverflow.com/help/how-to-ask — Grismar, Dec 21 '21 at 21:37
This is a pandas regex question. Please make sure to tag [tag:pandas]. Also, there are many duplicates, please search for them. — smci, Dec 21 '21 at 22:10
[**`df['Column1'].str.replace(pat, repl, ...)`**](https://stackoverflow.com/questions/28986489/how-to-replace-text-in-a-string-column-of-a-pandas-dataframe) , see that duplicate question. The rest is just finding the specific regex for your case. — smci, Dec 22 '21 at 07:36

score 0 · Answer 1 · answered Dec 21 '21 at 21:40

0

Assuming all you want is CT 500M, and all follow the same format, apply a lambda function that splits by "-", and get the third index

 df["Column1"] = df.apply(lambda x: x["Column1"].split("-")[2], axis=1)

You could also split by "RM"

answered Dec 21 '21 at 21:40

gotenks

Don't use `apply` when you have the `str.split` vectorial method – mozway Dec 21 '21 at 21:48
I don't understand, why not? – gotenks Dec 21 '21 at 21:59
1

This is much less efficient – mozway Dec 21 '21 at 22:00

score 0 · Answer 2 · answered Dec 21 '21 at 21:49

Use the re.sub() method from the re module to replace the string you don't want with ''. Apply it to the column. Something like this should work.

for i in Column1:
   i = re.sub('^\*.*\*', '', i)

or

Column1 = [re.sub('^\*.*\*', '', i) for i in Column1]

^\*.*\* basically finds all characters between a starting * and the last *. Re.sub() finds each one and substitutes it with whatever you choose.

score 0 · Answer 3 · answered Dec 21 '21 at 21:55

Assuming you want to remove everything between double asterisks, use Series.str.replace with a regex ('\*\*.*?\*\*'):

df['Column1'] = df['Column1'].str.replace('\*\*.*?\*\*', '', regex=True)

Output:

    Column1
0   CT 500M
1   CT 500M
2   CT 500M
3  CT 1000M

3 Answers3