I'm playing around with some text analysis and trying to display the top words by each book, using inverse document frequency (numeric value). I've largely been following along with the TidyText mining, but using Harry Potter.
The top words (using IDF) between some of the books are the same (e.g. Lupin or Griphook) and when plotting, the order uses the max IDF for that word. For example, griphook is a key word in both Sorcerer's Stone and Deathly Hallows. It has a value of .0007 in Deathly Hallows but only .0002, but is ordered as the top value for the Sorcerer's Stone.
hp.plot <- hp.words %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word))))
##For correct ordering of books
hp.plot$book <- factor(hp.plot$book, levels = c('Sorcerer\'s Stone', 'Chamber of Secrets',
'Prisoner of Azkhaban', 'Goblet of Fire',
'Order of the Phoenix', 'Half-Blood Prince',
'Deathly Hallows'))
hp.plot %>%
group_by(book) %>%
top_n(10) %>%
ungroup %>%
ggplot(aes(x=word, y=tf_idf, fill = book, group = book)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~book, scales = "free") +
coord_flip()
And here's an image of the dataframe for your reference.
I've tried sorting beforehand but that doesn't seem to work. Any ideas?
Edit: CSV is here