You can use re.sub
or Series.str.replace
with a regex to look for any of the words in your negation_words
list followed by spaces, and replace it with an underscore.
import re
negation_words = ["no", "not"]
escaped_words = "|".join(re.escape(word) for word in negation_words)
print(repr(escaped_words))
# 'no|not'
regex = fr"({escaped_words})\s+"
print(repr(regex))
# '(no|not)\\s+'
Regex explanation:
(no|not)\s+
( ) : Capturing group. Whatever is matched inside is available to the replace string as \1 (since this is the first capturing group)
no|not : Either of (no, not). If there are more words, then any one of these words
\s+ : One or more whitespace
Now, call Series.str.replace
with case=False
to do a case-insensitive match:
df = pd.DataFrame({'tweets': ['this is a tweet', 'this is not a tweet', 'no', 'Another tweet', 'Not another tweet', 'Tweet not']})
df['clean'] = df['tweets'].str.replace(regex, r'\1_', case=False, regex=True)
which gives:
tweets clean
0 this is a tweet this is a tweet
1 No tweeting No_tweeting
2 no no
3 Another tweet Another tweet
4 Not another tweet Not_another tweet
5 Tweet not Tweet not
To join two words after an occurrence of one of negation_words
is slightly more complicated:
regex = fr"({escaped_words})\s+(\w+)\s+"
print(repr(regex))
# '(no|not)\\s+(\\w+)\\s+'
Explanation:
(no|not)\s+(\w+)\s+
( ) : Capturing group. Whatever is matched inside is available to the replace string as \1 (since this is the first capturing group)
no|not : Either of (no, not). If there are more words, then any one of these words
\s+ : One or more whitespace
( ) : Capturing group #2
\w+ : One or more word characters
\s+ : One or more whitespace
df = pd.DataFrame({'tweets': ['this is a tweet', 'this is not a tweet', 'no', 'Another tweet', 'Not another tweet', 'Tweet not']})
df['clean'] = df['tweets'].str.replace(regex, r'\1_\2_', case=False, regex=True)
which gives:
tweets clean
0 this is a tweet this is a tweet
1 this is not a tweet this is not_a_tweet
2 no no
3 Another tweet Another tweet
4 Not another tweet Not_another_tweet
5 Tweet not Tweet not