0

I am trying to count the number of times some words occur in a sentence while controlling for negations. In the example below, I write a very basic code where I count the number of times "w" appear in "txt". Yet, I fail to control for negations like "don't" and/or "not".

w = ["hello", "apple"]

for word in w:
    txt = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."
    print(txt.count(word))

The code should say that it finds "apple" only times and not 4. So, I would like to add: if, n words before or after the words in "w" there is a negation, then don't count, and otherwise.

N.B. Negations here are words like "don't" and "not".

Can anyone help me with this?

Thanks a lot for your help!

Rollo99
  • 1,601
  • 7
  • 15
  • 3
    (1) Split your text into a list of sentences; (2) Throw away sentences that contain `'not'` or `'don't'`; (3) Count the number of occurrences in the remaining sentences. – Stef Mar 20 '23 at 16:10
  • thanks for your hint. That would be quite biased as the negation may be referred to words which are not contained in my list – Rollo99 Mar 20 '23 at 16:22
  • 1
    @Stef what if the sentence is "I love apples, but not peaches"? (2) above will throw it away. – Super-intelligent Shade Mar 20 '23 at 16:25
  • This really requires a [Sentiment Analysis model](https://www.datarobot.com/blog/introduction-to-sentiment-analysis-what-is-sentiment-analysis/). – Super-intelligent Shade Mar 20 '23 at 16:35
  • @Super-intelligentShade Yes, of course the sentence would be thrown away. That's literally what the OP asked. I don't know why your formulated your comment as a question and pinged me. Was that purely rhetorical or are you actually expecting an answer? If you were expecting an answer, sorry, I don't have one for you. – Stef Mar 20 '23 at 16:38
  • @Stef sorry I wasn't trying to harp on you. Just pointing out that while your idea works on the particular example, it will fail in general case. – Super-intelligent Shade Mar 20 '23 at 16:46

1 Answers1

1

Firstly, before you consider the negations/negatives, str.count might not be doing what you're expecting.

text = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."

text.count('apple') # Outputs: 4

But if you do:

text = "The thief grappled the pineapples and ran away with a basket of apples"

text.count('apple') # Outputs: 3

If you want to count the words, you would need to do some tokenization first to change the string into a list of strings, e.g.

from collections import Counter

import nltk
from nltk import word_tokenize

nltk.download('punkt')

text = "The thief grappled the pineapples and ran away with a basket of apples"

Counter(word_tokenize(text))['apple'] # Output: 0
Counter(word_tokenize(text))['apples'] # Output: 1

Then you would need to ask yourself does plural matters when you want to count the no. of times apple/apples occur? If so, then you would have to do some stemming or lemmatization, Stemmers vs Lemmatizers

This tutorial might be helpful: https://www.kaggle.com/code/alvations/basic-nlp-with-nltk


Assuming that you adopt lemmas and tokenizers and consider whatever you need to define what is a "word" and how to count them, you have to define what is negation and what do you want to do with the counts ultimately?

Lets go with

I want to break the text down into "chunks" or clauses that have positive and negative sentiment towards some object/nouns.

Then you would have to define what does negative/positive means, in the simplest terms you might say

anything negation words that comes near the window of the focus noun we consider as "negative" and in any other case, positive.

And if we try to code up the simplest terms of quantifying negation as above, you would first, have to

  • identify the focus word, lets take the word apple and
  • then the window, lets say 5 words before and 5 words after.

In code:

import nltk
from nltk import word_tokenize, ngrams

text = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."

NEGATIVE_WORDS = ["don't", "do not", "not"]
# Add all the forms of tokenized negative words
NEGATIVE_WORDS += [word_tokenize(w) for w in NEGATIVE_WORDS]

def count_negation(tokens):
    return sum(1 for word in tokens if word in NEGATIVE_WORDS)

for window in ngrams(word_tokenize(text), 5): 
  if "apple" in window or "apples" in window:
    print(count_negation(window), window)

[out]:

0 ('I', 'love', 'apples', ',', 'apple')
0 ('love', 'apples', ',', 'apple', 'are')
0 ('apples', ',', 'apple', 'are', 'my')
0 (',', 'apple', 'are', 'my', 'favorite')
0 ('apple', 'are', 'my', 'favorite', 'fruit')
0 ('do', "n't", 'really', 'like', 'apples')
0 ("n't", 'really', 'like', 'apples', 'if')
0 ('really', 'like', 'apples', 'if', 'they')
0 ('like', 'apples', 'if', 'they', 'are')
0 ('apples', 'if', 'they', 'are', 'too')
1 ('I', 'do', 'not', 'like', 'apples')
1 ('do', 'not', 'like', 'apples', 'if')
1 ('not', 'like', 'apples', 'if', 'they')
0 ('like', 'apples', 'if', 'they', 'are')
0 ('apples', 'if', 'they', 'are', 'immature')

Q: But isn't that kind of over-counting when I do not like apples get counted 3 times even though the sentence/clause appears once in the text?

Yes, it is over-counting, so it goes back to the question of what is the ultimate goal of counting the negations?

If the ultimate goal is to have a sentiment classifier then I think lexical approaches might not be as good as state-of-the-art language models, like:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/flan-t5-large"

tokenizer= AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."


prompt=f"""Do I like apples or not?
QUERY:{text}
OPTIONS:
 - Yes, I like apples
 - No, I hate apples
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
tokenize.decode(model.generate(input_ids)[0], skip_special_tokens=True)

[out]:

Yes, I like apples

Q: But what if I want to explain why the model assumes positive/negative sentiments towards apple? How can I do it without counting negations?

A: Good point, it's an active research area to explain the outputs, so definitely, there's no clear answer yet but take a look at https://aclanthology.org/2022.coling-1.406

alvas
  • 115,346
  • 109
  • 446
  • 738