I have survey data with a comments column. I am looking to so sentiment analysis on the responses. The problem is there are many languages in the data and I can't figure out how to eliminate multiple language stopwords from the set
'nps' is my data source, nps$customer.feedback is the comments column.
First I tokenize the data
#TOKENISE
comments <- nps %>%
filter(!is.na(cusotmer.feedback)) %>%
select(cat, Comment) %>%
group_by(row_number(), cat)
comments <- comments %>% ungroup()
Getting rid of stopwords
nps_words <- nps_words %>% anti_join(stop_words, by = c('word'))
Then use Stemming and get_sentimets("bing") to show word counts by sentiment.
#stemgraph
nps_words %>%
mutate(word = wordStem(word)) %>%
inner_join(get_sentiments("bing") %>% mutate(word = wordStem(word)), by =
c('word')) %>%
count(cat, word, sentiment) %>%
group_by(cat, sentiment) %>%
top_n(7) %>%
ungroup() %>%
ggplot(aes(x=reorder(word, n), y = n, fill = sentiment)) +
geom_col() +
coord_flip() +
facet_wrap( ~cat, scales = "free") +
scale_fill_brewer(palette = "Set1") +
labs(title = "Word counts by Sentiment by Category - Bing (Stemmed)", x =
`"Words", y = "Count")`
However, "di" and "die" appear in 'negative' graph due to German text being analyzed.
Can someone help?
My goal is to get eliminate German stopwords using the above code.