4

I'm playing around with some text analysis and trying to display the top words by each book, using inverse document frequency (numeric value). I've largely been following along with the TidyText mining, but using Harry Potter.

The top words (using IDF) between some of the books are the same (e.g. Lupin or Griphook) and when plotting, the order uses the max IDF for that word. For example, griphook is a key word in both Sorcerer's Stone and Deathly Hallows. It has a value of .0007 in Deathly Hallows but only .0002, but is ordered as the top value for the Sorcerer's Stone.

ggplot output

hp.plot <- hp.words %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word))))

##For correct ordering of books
hp.plot$book <- factor(hp.plot$book, levels = c('Sorcerer\'s Stone', 'Chamber of Secrets',
                                                 'Prisoner of Azkhaban', 'Goblet of Fire',
                                                 'Order of the Phoenix', 'Half-Blood Prince',
                                                 'Deathly Hallows'))

hp.plot %>%
  group_by(book) %>% 
  top_n(10) %>% 
  ungroup %>%
  ggplot(aes(x=word, y=tf_idf, fill = book, group = book)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~book, scales = "free") +
  coord_flip()

And here's an image of the dataframe for your reference.

I've tried sorting beforehand but that doesn't seem to work. Any ideas?

Edit: CSV is here

GeorgeR90
  • 137
  • 1
  • 9

3 Answers3

2

The reorder() function will reorder a factor by a specified variable (see ?reorder).

Inserting mutate(word = reorder(word, tf_idf)) after ungroup() in your last block before plotting should reorder by tf_idf. I don't have a sample of your data, but using the janeaustenr package, this does the same:

library(tidytext)
library(janeaustenr)
library(dplyr)

book_words <- austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word, sort = TRUE) %>%
  ungroup()

total_words <- book_words %>% 
  group_by(book) %>% 
  summarize(total = sum(n))

book_words <- left_join(book_words, total_words)

book_words <- book_words %>%
  bind_tf_idf(word, book, n) 


library(ggplot2)
book_words %>% 
  group_by(book) %>%
  top_n(10) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, tf_idf)) %>% 
  ggplot(aes(x = word, y = tf_idf, fill = book, group = book)) + 
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~book, scales = "free") +
  coord_flip()
jdb
  • 147
  • 7
  • 1
    Thanks for giving it a look. This big issue is that there are no shared words in the Jane Austen book. Your solution seems to force it in the first place a word appears, but it winds up in the wrong place for the next time a word is in the list. I've attached a csv with data to try it with. – GeorgeR90 Jul 24 '17 at 19:35
  • 1
    Ah, I see the problem now. I'm not sure how to make different factor orders for each facet, but if you split the data frame by book you can make individual plots for each book using this answer. – jdb Jul 24 '17 at 21:39
  • Thanks @jdb, knowing the actual verbiage to search for lead me to a working answer! – GeorgeR90 Jul 25 '17 at 13:08
1

Was asking a question already answered before, but I wasn't familiar with the terminology for ggplot. It's answered in the SO thread below.

ggplot: Order bars in faceted bar chart per facet

GeorgeR90
  • 137
  • 1
  • 9
0

If you want to change order of factor levels manually you could try:

word = factor(word, levels = word[c(grep("griphook", word)[1], grep("quirrell", word)[1], ...)]);

If factor levels should be ordered by tf_idf you could use the following:

level_ordered =rep(0, l)
for (i in 0: (l-1))
{
    level_ordered = c(level_ordered, grep(as.character((sort(tf_idf, partial=l-i)[l-i])), tf_idf)[1])
}
word = factor(word, levels=word[level_ordered])
HerthaBSC
  • 139
  • 9