2

I'm trying to get good with tidyr. Is there a better way to prep the anscombe dataset for plotting with ggplot2? Specifically, I don't love having to add data (obs_num). How would you do this?

library(tidyverse)
library(datasets)

anscombe %>%
  mutate(obs_num = 1:n()) %>%
  gather(variable, value, -obs_num) %>%
  separate(variable, c("variable", "set"), 1) %>%
  spread(variable, value) %>%
  ggplot(aes(x = x, y = y)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
  facet_wrap(~set)
Alex Coppock
  • 2,122
  • 3
  • 15
  • 31

1 Answers1

2

I think you need to add the extra column in order to uniquely identify each observation in the call to spread. Hadley discusses this in a comment on this SO question. Another approach would be to separately stack the x and y columns, as in the code below, but I don't see why that would be any better than your version. In fact, it could be worse if there are cases where the x and y values end up out of correspondence:

bind_cols(anscombe %>% select(matches("x")) %>% gather(set, "x"),
          anscombe %>% select(matches("y")) %>% gather(key, "y")) %>%
  select(-key) %>%
  mutate(set = gsub("x", "Set: ", set))

Another option would be to use base reshape, which is more succinct:

anscombe %>% 
  reshape(varying=1:8, direction="long", sep="", timevar="set")
Community
  • 1
  • 1
eipi10
  • 91,525
  • 24
  • 209
  • 285
  • 1
    `reshape` is mysterious and powerful! fantastic one line solution, and I'm not convinced that the tidyverse solution is any less opaque in this case. – Alex Coppock Oct 19 '16 at 17:50
  • 1
    Yes, I find base `reshape` mysterious as well. It would be nice if `tidyr` could similarly deal with multiple pairs of corresponding columns. – eipi10 Oct 19 '16 at 17:52