4

I'm trying to gather data for two different variables spread over several columns each, grouped by two other variables. Here's the problem. I have several genes, several samples. Each sample has three different possible genotypes, each with an associated frequency. I want to tidy this to get a single column for gene, sample, genotype, frequency.

I have a hackjob solution to this that involves creating listcolumns, spreading those, then extracting the columns with purrr::map functions. It's ugly, not really scalable, and the frequency gets converted to a character before it gets converted back to numeric, not ideal.

Is there a better way to solve this problem?


library(tidyverse) 
# or, separately load dplyr, tibble, tidyr, purrr

# Here's what I have
have <- data_frame(gene=rep(c("gX", "gY"), each=2),
                   sample=rep(c("s1", "s2"), 2),
                   genotype1=c("AA", "AA", "GG", "GG"),
                   genotype2=c("AC", "AC", "GT", "GT"),
                   genotype3=c("CC", "CC", "TT", "TT"),
                   freq1=c(.8,.9, .7, .6),
                   freq2=c(.15,.1, .2, .35),
                   freq3=c(.05,0, .1, .05))
have
#> # A tibble: 4 × 8
#>    gene sample genotype1 genotype2 genotype3 freq1 freq2 freq3
#>   <chr>  <chr>     <chr>     <chr>     <chr> <dbl> <dbl> <dbl>
#> 1    gX     s1        AA        AC        CC   0.8  0.15  0.05
#> 2    gX     s2        AA        AC        CC   0.9  0.10  0.00
#> 3    gY     s1        GG        GT        TT   0.7  0.20  0.10
#> 4    gY     s2        GG        GT        TT   0.6  0.35  0.05


# Here's what I want. 
# Do a multicolumn gather grouped by gene and sample
want <- have %>%
  group_by(gene, sample) %>%
  summarize(x1=list(c(genotype=genotype1, freq=freq1)),
            x2=list(c(genotype=genotype2, freq=freq2)),
            x3=list(c(genotype=genotype3, freq=freq3))) %>%
  ungroup() %>%
  gather(key, value, x1, x2, x3) %>%
  mutate(genotype=map_chr(value, "genotype"),
         freq=map_chr(value, "freq") %>% as.numeric) %>%
  select(-key, -value) %>%
  arrange(gene, sample, genotype)
want
#> # A tibble: 12 × 4
#>     gene sample genotype  freq
#>    <chr>  <chr>    <chr> <dbl>
#> 1     gX     s1       AA  0.80
#> 2     gX     s1       AC  0.15
#> 3     gX     s1       CC  0.05
#> 4     gX     s2       AA  0.90
#> 5     gX     s2       AC  0.10
#> 6     gX     s2       CC  0.00
#> 7     gY     s1       GG  0.70
#> 8     gY     s1       GT  0.20
#> 9     gY     s1       TT  0.10
#> 10    gY     s2       GG  0.60
#> 11    gY     s2       GT  0.35
#> 12    gY     s2       TT  0.05
Stephen Turner
  • 2,574
  • 8
  • 31
  • 44
  • 5
    I would simply do `library(data.table) ; melt(setDT(have), id = 1:2, measure = patterns("genotype", "freq"))` but that wasn't developed by Hadley, so you can safely ignore. – David Arenburg Jan 26 '17 at 18:56
  • @DavidArenburg this works beautifully. consider making it an official answer. – Stephen Turner Jan 26 '17 at 19:01

2 Answers2

6

You could use to_long() from the sjmisc-package, which gathers multiple columns at once:

to_long(have, keys = "genos", values = c("genotype", "freq"),
       c("genotype1", "genotype2", "genotype3"),
       c("freq1", "freq2", "freq3"))

##  A tibble: 12 × 5
##     gene sample     genos genotype  freq
##    <chr>  <chr>     <chr>    <chr> <dbl>
## 1     gX     s1 genotype1       AA  0.80
## 2     gX     s2 genotype1       AA  0.90
## 3     gY     s1 genotype1       GG  0.70
## 4     gY     s2 genotype1       GG  0.60
## 5     gX     s1 genotype2       AC  0.15
## 6     gX     s2 genotype2       AC  0.10
## 7     gY     s1 genotype2       GT  0.20
## 8     gY     s2 genotype2       GT  0.35
## 9     gX     s1 genotype3       CC  0.05
## 10    gX     s2 genotype3       CC  0.00
## 11    gY     s1 genotype3       TT  0.10
## 12    gY     s2 genotype3       TT  0.05

to_long() requires the name(s) of the key- and value-columns, followed by multiple column names per vector that should be gathered.

Daniel
  • 7,252
  • 6
  • 26
  • 38
1

Full tidyverse approach:

want <- have %>%
     gather(variable, value, -gene, -sample) %>% 
     mutate(group = parse_number(variable),
            variable = str_extract(variable,"\\D+")) %>% 
     spread(variable, value) %>% 
     select(-group)
Jake Kaupp
  • 7,892
  • 2
  • 26
  • 36
  • Is *tidyverse* a design philosophy or a restricted set of packages? – Daniel Jan 26 '17 at 19:54
  • I would say both, or at least a workflow that focuses on producing readable code via pipes and focusing on tidy data. – Jake Kaupp Jan 27 '17 at 01:33
  • @Daniel Have to agree with Jake - I also would go with both on this. Here is a good place to start http://vita.had.co.nz/papers/tidy-data.pdf or https://vimeo.com/33727555, and a great place to continue could be here: http://r4ds.had.co.nz/ – leerssej Jan 27 '17 at 01:34
  • Yes, I know these resources. I personally think that _tidyverse_ is more a phoilosophy (see also https://github.com/tidyverse/tidyverse/blob/master/vignettes/manifesto.Rmd), and not related to a specific author (Hadley in this case). The `to_long()`-function follows the same design principles as `tidyr::gather()`, so why would this not fit into a "tidyverse approach"? I think my point is about the same thing @David-Arenburg mentioned: "It wasn't developed by Hadley, so you can safely ignore". ;-) – Daniel Jan 27 '17 at 10:32
  • 1
    Some people like an all-in-one approach that limits amount of pieces and packages they have to learn. The `to_long()` would definitely fit within the tidy data tools and philosophy. My answer was just limited to only using the `tidyverse` collection of packages, rather than to spark silly package/author discussions – Jake Kaupp Jan 27 '17 at 13:18