My Regex to remove RT is not working for some reason

Question

The head of my dataframe looks like this

for i in df.index:
    txt = df.loc[i]["tweet"]
    txt=re.sub(r'@[A-Z0-9a-z_:]+','',txt)#replace username-tags
    txt=re.sub(r'^[RT]+','',txt)#replace RT-tags
    txt = re.sub('https?://[A-Za-z0-9./]+','',txt)#replace URLs
    txt=re.sub("[^a-zA-Z]", " ",txt)#replace hashtags
    df.at[i,"tweet"]=txt

However, running this does not remove the 'RT' tags. In addition, I would like to remove the 'b' tag also.

Raw result tweet column:

b Yal suppose you would people waiting for a tub of paint and garden furniture the league is gone and any that thinks anything else is a complete tool of a human who really needs to get down off that cloud lucky to have it back for
b RT watching porn aftern normal people is like no turn it off they don xe x x t love each other
b RT If not now when nn
b Used red wine as a chaser for Captain Morgan xe x x s Fun times
b RT shackattack Hold the front page s Lockdown property project sent me up the walls

Can you add a copy-and-pastable version of your data? https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — Nick ODell, Oct 31 '20 at 15:18

score 2 · Accepted Answer · answered Oct 31 '20 at 15:49

Your regular expression is not working, beause this sing ^ means at the beginning of the string. But the two characters you want to remove are not at the beginning.

Change r'^[RT]+' to r'[RT]+' the two letters will be removed. But tbe carefull beacause all other matches will be removed, too.

If you want to remove the letter be as well, try r'^b\s([RT]+)?'.

I suggest you try it yourself on https://regex101.com/

My Regex to remove RT is not working for some reason

1 Answers1