Text mining frequency with ggplot

Question

I am working with a dataset called HappyDB for a class presentation and analyzing demographic differences in word frequency. I'm using tidytext for most of the analyses, and using their online guide to create most of my visuals. However, I'm running into a problem with the code to create the frequency plot of words with labels. My dataset is structured differently from theirs, and I thought I was accounting for it but I evidently was not. This is their sample code to generate the graph (comparing Jane Austen with the Bronte sisters and H.G. Wells)

library(tidyr)

frequency <- bind_rows(mutate(tidy_bronte, author = "Brontë Sisters"),
                   mutate(tidy_hgwells, author = "H.G. Wells"), 
                   mutate(tidy_books, author = "Jane Austen")) %>% 
mutate(word = str_extract(word, "[a-z']+")) %>%
count(author, word) %>%
group_by(author) %>%
mutate(proportion = n / sum(n)) %>% 
select(-n) %>% 
spread(author, proportion) %>% 
gather(author, proportion, `Brontë Sisters`:`H.G. Wells`)

library(scales)

# expect a warning about rows with missing values being removed
ggplot(frequency, aes(x = proportion, y = `Jane Austen`, color = abs(`Jane Austen` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
  facet_wrap(~author, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "Jane Austen", x = NULL)

And that code generates this plot:

I'm hoping to emulate this with demographics in my dataset, but keep getting errors. Here is my code, which uses a dataset that I have already tidied:

library(dplyr)
library(tidyr)
library(ggplot2)
library(tidytext)
library(stringr) 

windowsFonts(Franklin=windowsFont("Franklin Gothic Demi"))

marriedmen <- tidy_hm[which(tidy_hm$marital =="married" &
                               tidy_hm$gender == "m"),]
marriedwomen <- tidy_hm[which(tidy_hm$marital =="married" &
                                tidy_hm$gender == "f"),]
singlemen <- tidy_hm[which(tidy_hm$marital =="single" &
                             tidy_hm$gender == "m"),]

frequency <- bind_rows(mutate(marriedmen, status = "Married men"),
                       mutate(marriedwomen, status = "Married women"), 
                       mutate(singlemen, status = "Single men")) %>% 
  count(status, word) %>%
  group_by(status) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  spread(status, proportion) %>% 
  gather(status, proportion, `Married women`:`Single men`)

library(scales)

# expect a warning about rows with missing values being removed
ggplot(frequency, aes(x = proportion, y = 'Married men', color = abs(`Married men` - proportion)) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
  facet_wrap(~status, ncol = 2) +
  theme(legend.position="none") +
  labs(y = NULL, x = NULL)

But I keep getting this error:

Error in log(x, base) : non-numeric argument to mathematical function

I tried removing the scale rows, but that caused a bunch of data to get eliminated and the plot didn't look anything like it was supposed to, and had no line, labels, or colors. I'm pretty new to r and coding in general so any help is appreciated.

In your `ggplot` call you have `y = 'Married men'`, which is setting y to be that character string. It looks like you meant to use back-ticks rather than apostrophes, as you do with the `color` element (assuming `Married men` is a variable name in the `frequency` dataframe) — Andrew Gustar, Apr 16 '18 at 17:31
@AndrewGustar Yes thank you for that catch. But changing that now gives me this error: `(Error in combine_vars(data, params$plot_env, vars, drop = params$drop): At least one layer must contain all variables used for facetting)` — SRobProsc, Apr 17 '18 at 18:57
It is hard to tell without seeing a bit of detail of the structure of your dataframe, but I wonder if the `spread` then `gather` is having the effect of removing the `status` variable, which you are trying to use for faceting. — Andrew Gustar, Apr 17 '18 at 19:59
Welcome to Stack Overflow! I cannot quite figure out what has gone wrong here without seeing something of the structure of your dataframe. You could check out this question about making a reproducible example in R for help on how to make a small example dataset so that we can help you out: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Julia Silge, Apr 21 '18 at 21:06

Text mining frequency with ggplot

0 Answers0