I am trying to remove everythin but letters, numbers and ! ? . ; , @ ' from my python pandas column text. I have already read some other questions on the topic, but still can not make mine work.
Here is an example of what I am doing:
import pandas as pd
df = pd.DataFrame({'id':[1,2,3,4],
'text':['hey+ guys! wuzup',
'hello p3ople!What\'s up?',
'hey, how- thing == do##n',
'my name is bond, james b0nd']}
)
Then we have the following table:
id text
1 hey+ guys! wuzup
2 hello p3ople!What\'s up?
3 hey, how- thing == do##n
4 my name is bond, james b0nd
Now, tryng to remove everything but letters, numbers and ! ? . ; , @ '
First try:
df.loc[:,'text'] = df['text'].str.replace(r"^(?!(([a-zA-z]|[\!\?\.\;\,\@\'\"]|\d))+)$",' ',regex=True)
output
id text
1 hey+ guys! wuzup
2 hello p3ople!What's up?
3 hey, how- thing == do##n
4 my name is bond, james b0nd
Second try
df.loc[:,'text'] = df['text'].str.replace(r"(?i)\b(?:(([a-zA-Z\!\?\.\;\,\@\'\"\:\d])))",' ',regex=True)
output
id text
1 ey+ uys uzup
2 ello 3ople hat p
3 ey ow- hing == o##
4 y ame s ond ames 0nd
Third try
df.loc[:,'text'] = df['text'].str.replace(r'(?i)(?<!\w)(?:[a-zA-Z\!\?\.\;\,\@\'\"\:\d])',' ',regex=True)
output
id text
1 ey+ uys! uzup
2 ello 3ople! hat' p?
3 ey, ow- hing == o##
4 y ame s ond, ames 0nd
Afterwars, I also tried using re.sub() function using the same regex patterns, but still did not manage to have the expected the result. Being this expected result as follows:
id text
1 hey guys! wuzup
2 hello p3ople!What's up?
3 hey, how- thing don
4 my name is bond, james b0nd
Can anyone help me with that?
Links that I have seen over the topic:
Is there a way to remove everything except characters, numbers and '-' from a string
removing newlines from messy strings in pandas dataframe cells?
https://stackabuse.com/using-regex-for-text-manipulation-in-python/