0

I have a dataset that contains a title, and I want to extract some words from it. I used the count() function to check the number of total number of occurrences for each word, and then plot them. Here is the code:

install.packages("remotes")
remotes::install_github("tweed1e/werfriends")


library(werfriends)

friends_raw <- werfriends::friends_episodes

library(tidytext)
library(tidyverse)

custom_stop_words <- bind_rows(tibble(word = c("1","2", "one"), 
                                      lexicon = c("custom", "custom", "custom")), 
                               stop_words)

friends_raw %>%
  unnest_tokens(word, title) %>%
  mutate(word = str_remove(word, "'s")) %>%
  anti_join(bind_rows(custom_stop_words)) %>%
  count(word) %>%
  top_n(10) %>%
  mutate(word = fct_reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) + geom_col() + coord_flip() + 
  scale_y_continuous(breaks = seq(0,30,5))

In the friends_raw dataset there is also a column season for each title, and I would like to also plot the season where the occurences happen, with fill. The problem is that, with this approach I don't know how to save the season column and do the count, getting the results ordered. Any clues on how to perform this?

Norhther
  • 545
  • 3
  • 15
  • 35
  • Welcome to Stack Overflow! Could you make your problem reproducible by sharing a sample of your data so others can help (please do not use `str()`, `head()` or screenshot)? You can use the [`reprex`](https://reprex.tidyverse.org/articles/articles/magic-reprex.html) and [`datapasta`](https://cran.r-project.org/web/packages/datapasta/vignettes/how-to-datapasta.html) packages to assist you with that. See also [Help me Help you](https://speakerdeck.com/jennybc/reprex-help-me-help-you?slide=5) & [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269) – Tung Mar 24 '20 at 17:43
  • @Tung I edited the post – Norhther Mar 24 '20 at 18:53

1 Answers1

1

Instead of using count you can use add_count (after group_by(season)). This will give you counts for each season.

After that, if you group_by(word, season) you will have appropriate data to show number of words each season (and season column available for fill).

friends_raw %>%
  unnest_tokens(word, title) %>%
  mutate(word = str_remove(word, "'s")) %>%
  anti_join(bind_rows(custom_stop_words)) %>%
  group_by(season) %>%
  add_count(word) %>%
  group_by(word, season) %>%
  slice(1) %>%
  group_by(word) %>%
  mutate(word_total = sum(n)) %>%
  ungroup() %>%
  filter(word_total>5) %>%
  mutate(word = fct_reorder(word, word_total)) %>%
  ggplot(aes(x = word, y = n, fill = factor(season))) + geom_col() + coord_flip() + 
  scale_y_continuous(breaks = seq(0,30,5)) +
  scale_fill_discrete(name = "Season")

Plot

Friends plot by season with number of words

Ben
  • 28,684
  • 5
  • 23
  • 45
  • Could you explain the ` group_by(word, season) %>% slice(1) %>% group_by(word) %>%` part? – Norhther Mar 25 '20 at 18:22
  • The `add_count(word)` will give word counts for each season (since grouped), but will have duplicates of count/n for each episode. When we `group_by(word, season)` and then `slice(1)` we only take one of the counts for a given word/season combination, ignoring the duplicate counts for the other episodes. The purpose of the second `group_by(word)` is just to total word count for each word, across all seasons for the plot (to order the factor and filter). Let me know if this makes sense. – Ben Mar 25 '20 at 18:29