0

I am trying to remove everythin but letters, numbers and ! ? . ; , @ ' from my python pandas column text. I have already read some other questions on the topic, but still can not make mine work.

Here is an example of what I am doing:

import pandas as pd
df = pd.DataFrame({'id':[1,2,3,4],
                  'text':['hey+ guys! wuzup',
                              'hello p3ople!What\'s up?',
                              'hey, how-  thing == do##n',
                              'my name is bond, james b0nd']}
                )

Then we have the following table:

id                         text
1              hey+ guys! wuzup
2      hello p3ople!What\'s up?
3     hey, how-  thing == do##n
4   my name is bond, james b0nd

Now, tryng to remove everything but letters, numbers and ! ? . ; , @ '

First try:

df.loc[:,'text'] = df['text'].str.replace(r"^(?!(([a-zA-z]|[\!\?\.\;\,\@\'\"]|\d))+)$",' ',regex=True)

output

id                         text
1              hey+ guys! wuzup
2       hello p3ople!What's up?
3      hey, how- thing == do##n
4   my name is bond, james b0nd

Second try

df.loc[:,'text'] = df['text'].str.replace(r"(?i)\b(?:(([a-zA-Z\!\?\.\;\,\@\'\"\:\d])))",' ',regex=True)

output

id                         text
1                  ey+ uys uzup
2              ello 3ople hat p
3            ey ow- hing == o##
4          y ame s ond ames 0nd

Third try

df.loc[:,'text'] = df['text'].str.replace(r'(?i)(?<!\w)(?:[a-zA-Z\!\?\.\;\,\@\'\"\:\d])',' ',regex=True)

output

id                         text
1                 ey+ uys! uzup
2           ello 3ople! hat' p?
3           ey, ow- hing == o##
4         y ame s ond, ames 0nd

Afterwars, I also tried using re.sub() function using the same regex patterns, but still did not manage to have the expected the result. Being this expected result as follows:

id                         text
1               hey guys! wuzup
2       hello p3ople!What's up?
3          hey, how-  thing don
4   my name is bond, james b0nd

Can anyone help me with that?

Links that I have seen over the topic:

Is there a way to remove everything except characters, numbers and '-' from a string

How do check if a text column in my dataframe, contains a list of possible patterns, allowing mistyping?

removing newlines from messy strings in pandas dataframe cells?

https://stackabuse.com/using-regex-for-text-manipulation-in-python/

Mariane Reis
  • 581
  • 1
  • 6
  • 21
  • A bread and butter example of "how to ask a question". Example dataframe, your attempts, expected output and links to answers which are close. +1 – Erfan Jan 07 '20 at 22:49
  • This is still an unclear question. If by letters and numbers you mean `[A-Za-z0-9]`, how do you define *ponctuations and some other characters*? – Wiktor Stribiżew Jan 07 '20 at 23:15
  • I edited it to specify what I meant by some other characters! And thanks @Erfan :) – Mariane Reis Jan 08 '20 at 20:51

1 Answers1

1

Is this what you are looking for?

df.text.str.replace("(?i)[^0-9a-z!?.;,@' -]",'')
Out: 
0                hey guys! wuzup
1        hello p3ople!What's up?
2          hey, how-  thing  don
3    my name is bond, james b0nd
Name: text, dtype: object
Onyambu
  • 67,392
  • 3
  • 24
  • 53