5

I have survey data with a comments column. I am looking to so sentiment analysis on the responses. The problem is there are many languages in the data and I can't figure out how to eliminate multiple language stopwords from the set

'nps' is my data source, nps$customer.feedback is the comments column.

First I tokenize the data

#TOKENISE
comments <- nps %>% 
  filter(!is.na(cusotmer.feedback)) %>% 
  select(cat, Comment) %>% 
  group_by(row_number(), cat) 

  comments <- comments %>% ungroup()

Getting rid of stopwords

nps_words <-  nps_words %>% anti_join(stop_words, by = c('word'))

Then use Stemming and get_sentimets("bing") to show word counts by sentiment.

 #stemgraph
  nps_words %>% 
  mutate(word = wordStem(word)) %>% 
  inner_join(get_sentiments("bing") %>% mutate(word = wordStem(word)), by = 
  c('word')) %>%
  count(cat, word, sentiment) %>%
  group_by(cat, sentiment) %>%
  top_n(7) %>%
  ungroup() %>%
  ggplot(aes(x=reorder(word, n), y = n, fill = sentiment)) +
  geom_col() +
  coord_flip() +
  facet_wrap( ~cat, scales = "free")  +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Word counts by Sentiment by Category - Bing (Stemmed)", x = 
  `"Words", y = "Count")`

However, "di" and "die" appear in 'negative' graph due to German text being analyzed.

Can someone help?

My goal is to get eliminate German stopwords using the above code.

Ashley Mills
  • 50,474
  • 16
  • 129
  • 160
Sean M
  • 83
  • 1
  • 5
  • 1
    A quick google find [this](https://github.com/stopwords-iso/stopwords-de) which has a list of stop words. Would it also be worth splitting your comments into detected language first [like this](https://stackoverflow.com/questions/8078604/detect-text-language-in-r)? – Michael Bird Aug 21 '18 at 15:33

1 Answers1

4

To answer your question you could do this to remove German stopwords. Using the stopwords package:

your code
.....  
stop_german <- data.frame(word = stopwords::stopwords("de"), stringsAsFactors = FALSE)

nps_words <-  nps_words %>% 
  anti_join(stop_words, by = c('word')) %>%
  anti_join(stop_german, by = c("word"))

...
rest of code

BUT, realise that tidytext is primarily meant for English, not for other languages. Word stemming and sentiment analyses with German text will give you incorrect results. Bing sentiment is only for English words. Doing an inner_join as you do will remove most of the German words as there is no match for it in English. But some match, like the word "die" (which you remove if you use German stopwords and means "who" or "that one"). But should you remove this word, you might accidentally remove the English "die" (decease).

This SO post gives some more info about German sentiment analyses.

phiver
  • 23,048
  • 14
  • 44
  • 56