0

The head of my dataframe looks like this

enter image description here

for i in df.index:
    txt = df.loc[i]["tweet"]
    txt=re.sub(r'@[A-Z0-9a-z_:]+','',txt)#replace username-tags
    txt=re.sub(r'^[RT]+','',txt)#replace RT-tags
    txt = re.sub('https?://[A-Za-z0-9./]+','',txt)#replace URLs
    txt=re.sub("[^a-zA-Z]", " ",txt)#replace hashtags
    df.at[i,"tweet"]=txt

However, running this does not remove the 'RT' tags. In addition, I would like to remove the 'b' tag also.

Raw result tweet column:

b Yal suppose you would people waiting for a tub of paint and garden furniture the league is gone and any that thinks anything else is a complete tool of a human who really needs to get down off that cloud lucky to have it back for
b RT watching porn aftern normal people is like no turn it off they don xe x x t love each other
b RT If not now when nn
b Used red wine as a chaser for Captain Morgan xe x x s Fun times
b RT shackattack Hold the front page s Lockdown property project sent me up the walls
Olvin Roght
  • 7,677
  • 2
  • 16
  • 35
  • Can you add a copy-and-pastable version of your data? https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Nick ODell Oct 31 '20 at 15:18

1 Answers1

2

Your regular expression is not working, beause this sing ^ means at the beginning of the string. But the two characters you want to remove are not at the beginning.

Change r'^[RT]+' to r'[RT]+' the two letters will be removed. But tbe carefull beacause all other matches will be removed, too.

If you want to remove the letter be as well, try r'^b\s([RT]+)?'.

I suggest you try it yourself on https://regex101.com/

mosc9575
  • 5,618
  • 2
  • 9
  • 32