2

Currently, I am using the library("wordcloud") to make a word cloud of frequent terms some text data that I have. The text data also comes with an associated year, and I want to be able to generate new word clouds based on the year, and I want it to be automatically animated using a library like gganimate. Is there any way to do this? I want to visualize the most frequent keywords over time, but I am struggling. Any tips?

Conor Neilson
  • 1,026
  • 1
  • 11
  • 27
babycoder
  • 23
  • 3
  • 1
    [See here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on making an R question that folks can help with. That includes a sample of data, all necessary code, and a clear explanation of what you're trying to do and what hasn't worked. Right now this is too broad for SO – camille Apr 10 '20 at 02:41

1 Answers1

4

Yes, with the help of the ggwordcloud package. I'll use the babynames dataset as an interesting example to see how the 5 most common baby names have changed over 100 years. First, load the required packages and load the data.

library(babynames)   # Data
library(dplyr)       # Data management
library(ggplot2)     # Graph framework
library(ggwordcloud) # Wordcloud using ggplot
library(gganimate)   # Animation

data(babynames)

The next command finds the top 5 names for each sex in 1915 and 2015, grouped by year.

babies <- babynames %>%
  filter(year %in% c(1915, 2015)) %>%
  group_by(name, sex, year) %>%
  summarise(n=sum(n)) %>%
  arrange(desc(n)) %>%
  group_by(year, sex) %>%
  top_n(n=5) %>%

# A tibble: 20 x 4
# Groups:   sex, year [4]
   name     sex    year     n
   <chr>    <chr> <dbl> <int>
 1 Mary     F      1915 58187
 2 John     M      1915 47577
 3 William  M      1915 38564
 4 James    M      1915 33776
 5 Helen    F      1915 30866
 6 Robert   M      1915 28738
 7 Dorothy  F      1915 25154
 8 Margaret F      1915 23054
 9 Joseph   M      1915 23052
10 Ruth     F      1915 21878
11 Emma     F      2015 20435
12 Olivia   F      2015 19669
13 Noah     M      2015 19613
14 Liam     M      2015 18355
15 Sophia   F      2015 17402
16 Mason    M      2015 16610
17 Ava      F      2015 16361
18 Jacob    M      2015 15938
19 William  M      2015 15889
20 Isabella F      2015 15594

  ungroup() %>%
  select(name, sex)

I halted it before the end just to show you which names were returned before omitting the years and frequency because I want to merge this data with the original to get the frequency for every 5 years between 1915 and 2015, not every year because it takes too long to plot.

Here's the join.

babyyears <- babynames %>%
  inner_join(babies, by=c("name","sex")) %>%
  filter(year>=1915 & year %% 5 == 0) %>%  # Keep all years if you like
  mutate(year=as.integer(year))  # For animation. Not sure why this is required.

So that's just setting up the data for the plot. If we just wanted a static wordcloud, we'd aggregate on the year. But we keep the years for the animation.

For plotting, we use ggplot with the geom_text_wordcloud function.

gg <- babyyears %>%
  ggplot(aes(label = name, size=n)) +
  geom_text_wordcloud() +
  theme_classic()

Then transition through the years.

gg2 <- gg + transition_time(year) +
  labs(title = 'Year: {frame_time}')

I like to add a pause at the end, otherwise the animation rolls around to the start immediately after finishing.

animate(gg2, end_pause=30)
anim_save("gg_anim_wc.gif")

enter image description here

It's hard to keep track of all the names (especially the boys) with them all being placed in random locations. Maybe slowing it down will help. But the name that stands out the most from this graphic is "Mary", which was the most common name in 1915 but then slowly started to lose popularity towards the latter half of the century.

Edward
  • 10,360
  • 2
  • 11
  • 26
  • 1
    Neat. Is there any way to constrain the placement of each specific name, so they don't jump around between years, only get larger/smaller? Also, it could be better to put female names on (say) the left hemisphere and male on the right. – smci Apr 10 '20 at 08:48
  • Yeah - that's the thing I couldn't figure out. There's a `seed` argument in the function to control randomness, but it doesn't seem to be able to fix the words. Maybe someone can figure it out... – Edward Apr 10 '20 at 08:50
  • Is it possible to save it as a video instead of a gif ? – Julien Aug 20 '22 at 09:09
  • In my case, I get several different images (frames) for the same level of the factor (here it's `Year`). Why is that ? – Julien Aug 20 '22 at 12:17