Count saving another column

Question

I have a dataset that contains a title, and I want to extract some words from it. I used the count() function to check the number of total number of occurrences for each word, and then plot them. Here is the code:

install.packages("remotes")
remotes::install_github("tweed1e/werfriends")


library(werfriends)

friends_raw <- werfriends::friends_episodes

library(tidytext)
library(tidyverse)

custom_stop_words <- bind_rows(tibble(word = c("1","2", "one"), 
                                      lexicon = c("custom", "custom", "custom")), 
                               stop_words)

friends_raw %>%
  unnest_tokens(word, title) %>%
  mutate(word = str_remove(word, "'s")) %>%
  anti_join(bind_rows(custom_stop_words)) %>%
  count(word) %>%
  top_n(10) %>%
  mutate(word = fct_reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) + geom_col() + coord_flip() + 
  scale_y_continuous(breaks = seq(0,30,5))

In the friends_raw dataset there is also a column season for each title, and I would like to also plot the season where the occurences happen, with fill. The problem is that, with this approach I don't know how to save the season column and do the count, getting the results ordered. Any clues on how to perform this?

Welcome to Stack Overflow! Could you make your problem reproducible by sharing a sample of your data so others can help (please do not use `str()`, `head()` or screenshot)? You can use the [`reprex`](https://reprex.tidyverse.org/articles/articles/magic-reprex.html) and [`datapasta`](https://cran.r-project.org/web/packages/datapasta/vignettes/how-to-datapasta.html) packages to assist you with that. See also [Help me Help you](https://speakerdeck.com/jennybc/reprex-help-me-help-you?slide=5) & [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269) — Tung, Mar 24 '20 at 17:43

Ben · Accepted Answer · 2020-03-25T02:58:23.267

1

Instead of using count you can use add_count (after group_by(season)). This will give you counts for each season.

After that, if you group_by(word, season) you will have appropriate data to show number of words each season (and season column available for fill).

friends_raw %>%
  unnest_tokens(word, title) %>%
  mutate(word = str_remove(word, "'s")) %>%
  anti_join(bind_rows(custom_stop_words)) %>%
  group_by(season) %>%
  add_count(word) %>%
  group_by(word, season) %>%
  slice(1) %>%
  group_by(word) %>%
  mutate(word_total = sum(n)) %>%
  ungroup() %>%
  filter(word_total>5) %>%
  mutate(word = fct_reorder(word, word_total)) %>%
  ggplot(aes(x = word, y = n, fill = factor(season))) + geom_col() + coord_flip() + 
  scale_y_continuous(breaks = seq(0,30,5)) +
  scale_fill_discrete(name = "Season")

Plot

edited Mar 25 '20 at 02:58

answered Mar 25 '20 at 01:42

Ben

28,684
5
23
45

Could you explain the ` group_by(word, season) %>% slice(1) %>% group_by(word) %>%` part? – Norhther Mar 25 '20 at 18:22
The `add_count(word)` will give word counts for each season (since grouped), but will have duplicates of count/n for each episode. When we `group_by(word, season)` and then `slice(1)` we only take one of the counts for a given word/season combination, ignoring the duplicate counts for the other episodes. The purpose of the second `group_by(word)` is just to total word count for each word, across all seasons for the plot (to order the factor and filter). Let me know if this makes sense. – Ben Mar 25 '20 at 18:29

Count saving another column

1 Answers1