Firstly, before you consider the negations/negatives, str.count
might not be doing what you're expecting.
text = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."
text.count('apple') # Outputs: 4
But if you do:
text = "The thief grappled the pineapples and ran away with a basket of apples"
text.count('apple') # Outputs: 3
If you want to count the words, you would need to do some tokenization first to change the string into a list of strings, e.g.
from collections import Counter
import nltk
from nltk import word_tokenize
nltk.download('punkt')
text = "The thief grappled the pineapples and ran away with a basket of apples"
Counter(word_tokenize(text))['apple'] # Output: 0
Counter(word_tokenize(text))['apples'] # Output: 1
Then you would need to ask yourself does plural matters when you want to count the no. of times apple/apples occur? If so, then you would have to do some stemming or lemmatization, Stemmers vs Lemmatizers
This tutorial might be helpful: https://www.kaggle.com/code/alvations/basic-nlp-with-nltk
Assuming that you adopt lemmas and tokenizers and consider whatever you need to define what is a "word" and how to count them, you have to define what is negation and what do you want to do with the counts ultimately?
Lets go with
I want to break the text down into "chunks" or clauses that have positive and negative sentiment towards some object/nouns.
Then you would have to define what does negative/positive means, in the simplest terms you might say
anything negation words that comes near the window of the focus noun we consider as "negative" and in any other case, positive.
And if we try to code up the simplest terms of quantifying negation as above, you would first, have to
- identify the focus word, lets take the word apple and
- then the window, lets say 5 words before and 5 words after.
In code:
import nltk
from nltk import word_tokenize, ngrams
text = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."
NEGATIVE_WORDS = ["don't", "do not", "not"]
# Add all the forms of tokenized negative words
NEGATIVE_WORDS += [word_tokenize(w) for w in NEGATIVE_WORDS]
def count_negation(tokens):
return sum(1 for word in tokens if word in NEGATIVE_WORDS)
for window in ngrams(word_tokenize(text), 5):
if "apple" in window or "apples" in window:
print(count_negation(window), window)
[out]:
0 ('I', 'love', 'apples', ',', 'apple')
0 ('love', 'apples', ',', 'apple', 'are')
0 ('apples', ',', 'apple', 'are', 'my')
0 (',', 'apple', 'are', 'my', 'favorite')
0 ('apple', 'are', 'my', 'favorite', 'fruit')
0 ('do', "n't", 'really', 'like', 'apples')
0 ("n't", 'really', 'like', 'apples', 'if')
0 ('really', 'like', 'apples', 'if', 'they')
0 ('like', 'apples', 'if', 'they', 'are')
0 ('apples', 'if', 'they', 'are', 'too')
1 ('I', 'do', 'not', 'like', 'apples')
1 ('do', 'not', 'like', 'apples', 'if')
1 ('not', 'like', 'apples', 'if', 'they')
0 ('like', 'apples', 'if', 'they', 'are')
0 ('apples', 'if', 'they', 'are', 'immature')
Q: But isn't that kind of over-counting when I do not like apples
get counted 3 times even though the sentence/clause appears once in the text?
Yes, it is over-counting, so it goes back to the question of what is the ultimate goal of counting the negations?
If the ultimate goal is to have a sentiment classifier then I think lexical approaches might not be as good as state-of-the-art language models, like:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "google/flan-t5-large"
tokenizer= AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
text = "I love apples, apple are my favorite fruit. I don't really like apples if they are too mature. I do not like apples if they are immature either."
prompt=f"""Do I like apples or not?
QUERY:{text}
OPTIONS:
- Yes, I like apples
- No, I hate apples
"""
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
tokenize.decode(model.generate(input_ids)[0], skip_special_tokens=True)
[out]:
Yes, I like apples
Q: But what if I want to explain why the model assumes positive/negative sentiments towards apple? How can I do it without counting negations?
A: Good point, it's an active research area to explain the outputs, so definitely, there's no clear answer yet but take a look at https://aclanthology.org/2022.coling-1.406