how to join two words if they were preceded by certain words in sentences in a dataframe

Question

I have a list of tweets containing negation words such as "not, never, seldom"

I want to convert "not nice" to "not_nice" (separated by an underscore).

How can I join all of the "not"s in the tweets, with the words that follow them?

I tried doing this but it doesn't change anything, the sentences remain the same withouh change

def combine(negation_words, word_scan):
    if type(negation_words) != list:
        negation_words = [negation_words]  
    n_index = []
    
    for i in negation_words:
        index_replace = [(m.end(0)) for m in re.finditer(i,word_scan)]
        n_index += index_replace
    for rep in n_index:
        letters = [x for x in word_scan]
        letters[rep] = "_"
        word_scan = "".join(letters)
    return word_scan

negation_words = ["no", "not"]
word_scan = df
combine(negation_words, word_scan)

df['clean'] = df['tweets'].apply(lambda x: combine(str(x), word_scan))
df

Pranav Hosangadi · Accepted Answer · 2023-02-17T18:41:02.363

1

You can use re.sub or Series.str.replace with a regex to look for any of the words in your negation_words list followed by spaces, and replace it with an underscore.

import re

negation_words = ["no", "not"]

escaped_words = "|".join(re.escape(word) for word in negation_words)
print(repr(escaped_words))
# 'no|not'

regex = fr"({escaped_words})\s+"
print(repr(regex))
# '(no|not)\\s+'

Regex explanation:

(no|not)\s+
(      )      : Capturing group. Whatever is matched inside is available to the replace string as \1 (since this is the first capturing group)
 no|not       : Either of (no, not). If there are more words, then any one of these words
        \s+   : One or more whitespace

Now, call Series.str.replace with case=False to do a case-insensitive match:

df = pd.DataFrame({'tweets': ['this is a tweet', 'this is not a tweet', 'no', 'Another tweet', 'Not another tweet', 'Tweet not']})

df['clean'] = df['tweets'].str.replace(regex, r'\1_', case=False, regex=True)

which gives:

              tweets              clean
0    this is a tweet    this is a tweet
1        No tweeting        No_tweeting
2                 no                 no
3      Another tweet      Another tweet
4  Not another tweet  Not_another tweet
5          Tweet not          Tweet not

To join two words after an occurrence of one of negation_words is slightly more complicated:

regex = fr"({escaped_words})\s+(\w+)\s+"
print(repr(regex))
# '(no|not)\\s+(\\w+)\\s+'

Explanation:

(no|not)\s+(\w+)\s+
(      )            : Capturing group. Whatever is matched inside is available to the replace string as \1 (since this is the first capturing group)
 no|not             : Either of (no, not). If there are more words, then any one of these words
        \s+         : One or more whitespace
           (   )    : Capturing group #2
            \w+     : One or more word characters
                \s+ : One or more whitespace

df = pd.DataFrame({'tweets': ['this is a tweet', 'this is not a tweet', 'no', 'Another tweet', 'Not another tweet', 'Tweet not']})

df['clean'] = df['tweets'].str.replace(regex, r'\1_\2_', case=False, regex=True)

which gives:

                tweets                clean
0      this is a tweet      this is a tweet
1  this is not a tweet  this is not_a_tweet
2                   no                   no
3        Another tweet        Another tweet
4    Not another tweet    Not_another_tweet
5            Tweet not            Tweet not

edited Feb 17 '23 at 18:41

answered Feb 17 '23 at 01:11

Pranav Hosangadi

23,755
7
44
70

@ZulfiA `re.escape` escapes any special characters in its input e.g. `(`, `[`, etc. so that the special meaning is ignored and they are considered as literal characters. This is not necessary for the particular `negation_words` you've shown in your question, but is useful to generalize the approach to cases where such special characters might exist. If your original dataframe contains all lowercase text, you can omit the `case` argument but it doesn't hurt to keep it – Pranav Hosangadi Feb 17 '23 at 03:59
so it's alright if I removed the escaped words right? my tweets data has no special characters in them because I already cleaned them all:) and I got his warning :FutureWarning: The default value of regex will change from True to False in a future version., what does it mean? – Zulfi A Feb 17 '23 at 04:37
It's not okay to remove `escaped_words`, because that line also joins all words with a pipe, which in regex means `word1 or word2 or word3 or ...`. you can change it to `'|'.join(negation_words)`, but honestly it won't make a difference to keep it if you have no special characters – Pranav Hosangadi Feb 17 '23 at 04:41
The warning is because the `str.replace` function has an argument `regex` which defaults to `true`. Since this will be changed in a future version you `False`, you can explicitly specify it to remove the warning @ZulfiA – Pranav Hosangadi Feb 17 '23 at 04:42
oh i see! what if i want to combine the words with two words that follow he negation words? example: instead of "not_nice" i want to make it like this "not_nice_really_". – Zulfi A Feb 17 '23 at 05:05
Then you'd just tack on the regex to match a word after the `\s+` in `regex`, see https://stackoverflow.com/a/20956093/843953 – Pranav Hosangadi Feb 17 '23 at 05:50
i think it's kind of different with what i'm looking for:( can you explain what "r'\1_'" means? thank you so much! – Zulfi A Feb 17 '23 at 06:21
r denotes a raw string literal. `\1` means the first capture group (which is surrounded by parentheses, in this case the negatiin_word present in the text), and `_` is a literal underscore. I'll update my answer to match two words tomorrow – Pranav Hosangadi Feb 17 '23 at 06:24
@ZulfiA see my updated answer to capture two words and join them with underscores – Pranav Hosangadi Feb 17 '23 at 18:41
i can't thank you enough for your help. have a nice day! – Zulfi A Feb 19 '23 at 00:31

how to join two words if they were preceded by certain words in sentences in a dataframe

1 Answers1