I have tried looking into this and couldn't find any possible way to do this the way I imagine. The term as an example I am trying to group is 'No complaints', when looking at this word the 'No' is picked up during the stopwords which I have manually removed from the stopwords to ensure it is included in the data. However, both words will be picked during the sentiment analysis as Negative words. I am wanting to combine them together so they can be categorised under either Neutral or Positive. Is it possible to manually group them words or terms together and decide how they are analysed in the sentiment analysis?
I have found a way to group words using POS tagging & Chunking but this combines tags together or Multi-Word Expressionsand doesn't necessarily pick them up correctly in the sentiment analysis.
Current code (using POS Tagging):
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize, sent_tokenize, MWETokenizer
import re, gensim, nltk
from gensim.utils import simple_preprocess
import pandas as pd
d = {'text': ['no complaints', 'not bad']}
df = pd.DataFrame(data=d)
stop = stopwords.words('english')
stop.remove('no')
stop.remove('not')
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations
data_words = list(sent_to_words(df))
def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
data_words_nostops = remove_stopwords(data_words)
txt = df
txt = txt.apply(str)
#pos tag
words = [word_tokenize(i) for i in sent_tokenize(txt['text'])]
pos_tag= [nltk.pos_tag(i) for i in words]
#chunking
tagged_token = nltk.pos_tag(tokenized_text)
grammar = "NP : {<DT>+<NNS>}"
phrases = nltk.RegexpParser(grammar)
result = phrases.parse(tagged_token)
print(result)
sia = SentimentIntensityAnalyzer()
def find_sentiment(post):
if sia.polarity_scores(post)["compound"] > 0:
return "Positive"
elif sia.polarity_scores(post)["compound"] < 0:
return "Negative"
else:
return "Neutral"
df['sentiment'] = df['text'].apply(lambda x: find_sentiment(x))
df['compound'] = [sia.polarity_scores(x)['compound'] for x in df['text']]
df
Output:
(S
0/CD
(NP no/DT complaints/NNS)
1/CD
not/RB
bad/JJ
Name/NN
:/:
text/NN
,/,
dtype/NN
:/:
object/NN)
|text |sentiment |compound
|:--------------|:----------|:--------
0 |no complaints |Negative |-0.5994
1 |not bad |Positive | 0.4310
I understand that my current code does not incorporate the POS Tagging & chunking in the sentiment analysis, but you can see the combination of the word 'no complaints' however it's current sentiment and sentiment score is negative (-0.5994), the aim is to use POS tagging and assign the sentiment as positive... somehow if possible!