long to wide format aggregate R tidyverse

Question

Hi given the following dataframe

library(tidyverse)

df <- data.frame(READS=rep(c('READa', 'READb', 'READc'),each=3) ,GENE=rep(c('GENEa', 'GENEb', 'GENEc'), each=3), COMMENT=rep(c('CommentA', 'CommentA', 'CommentA'),each=3))
> df
  READS  GENE  COMMENT
1 READa GENEa CommentA
2 READa GENEa CommentA
3 READa GENEa CommentA
4 READb GENEb CommentA
5 READb GENEb CommentA
6 READb GENEb CommentA
7 READc GENEc CommentA
8 READc GENEc CommentA
9 READc GENEc CommentA

I want to convert from long to wide format aggregating by Gene Column so that i get the following

         GENEa   GENEb  GENEc
READSa     3        3     3 
READSb     3        3     3

I have tried with no success:

 library(tidyverse)
      df %>% 
      group_by(GENE) %>% 
      select(-COMMENT) %>%
      spread(READS)

Note that the original dataframe is huge so any optimized code would help.

Thanks for your help.

Why `READsa` and `GENEc` is `3`? No such combination exists. — www, Dec 13 '19 at 14:00

score 2 · Accepted Answer · answered Dec 13 '19 at 14:03

Not really sure how you get the 3 count for GENEa and READSb, but assuming you want the count, you can try the following:


library(tidyverse)

df <- tibble(
  READS = rep(c("READa", "READb", "READc"), each = 3), 
  GENE = rep(c("GENEa", "GENEb", "GENEc"), each = 3), 
  COMMENT = rep(c("CommentA", "CommentA", "CommentA"), each = 3)
)
df
#> # A tibble: 9 x 3
#>   READS GENE  COMMENT 
#>   <chr> <chr> <chr>   
#> 1 READa GENEa CommentA
#> 2 READa GENEa CommentA
#> 3 READa GENEa CommentA
#> 4 READb GENEb CommentA
#> 5 READb GENEb CommentA
#> 6 READb GENEb CommentA
#> 7 READc GENEc CommentA
#> 8 READc GENEc CommentA
#> 9 READc GENEc CommentA

df %>%
  count(READS, GENE) %>%
  pivot_wider(
    names_from = GENE, values_from = n,
    values_fill = list(n = 0)
  )
#> # A tibble: 3 x 4
#>   READS GENEa GENEb GENEc
#>   <chr> <int> <int> <int>
#> 1 READa     3     0     0
#> 2 READb     0     3     0
#> 3 READc     0     0     3

^{Created on 2019-12-13 by the reprex package (v0.3.0)}

G. Grothendieck · Answer 2 · 2019-12-13T14:53:02.990

Assuming that you intended that the number in each output cell is the number of rows in the input that have that cell's row and column name then this is a one-liner in base R.

table(df[1:2])

giving this table class object:

       GENE
READS   GENEa GENEb GENEc
  READa     3     0     0
  READb     0     3     0
  READc     0     0     3

If you want the result as a data frame then:

as.data.frame.matrix(table(df[1:2]))

score 1 · Answer 3 · answered Dec 13 '19 at 14:18

library(tidyr) #v1.0.0
pivot_wider(df, -COMMENT, names_from = GENE, values_from = GENE, 
                          values_fn = list(GENE = length), values_fill = list(GENE=0))

# A tibble: 3 x 4
  READS GENEa GENEb GENEc
  <fct> <int> <int> <int>
1 READa     3     0     0
2 READb     0     3     0
3 READc     0     0     3

score 1 · Answer 4 · answered Dec 13 '19 at 14:34

1

An option with dcast

library(data.table)
dcast(setDT(df), READS ~ GENE, length)
#   READS GENEa GENEb GENEc
#1: READa     3     0     0
#2: READb     0     3     0
#3: READc     0     0     3

answered Dec 13 '19 at 14:34

akrun

874,273
37
540
662

score 0 · Answer 5 · answered Dec 13 '19 at 14:04

Given that some combinations of your desired output do not exist:

df <- data.frame(READS=rep(c('READa', 'READb', 'READc'),each=3) ,GENE=rep(c('GENEa', 'GENEb', 'GENEc'), each=3), COMMENT=rep(c('CommentA', 'CommentA', 'CommentA'),each=3))

df %>%
  group_by(READS, GENE) %>% 
  summarise(count = n()) %>% 
  spread(key = "GENE", value = "count")

would result in

  READS GENEa GENEb GENEc
1 READa     3    NA    NA
2 READb    NA     3    NA
3 READc    NA    NA     3

Note that spread was deprecated, in the new versions you should use pivot_wider.

long to wide format aggregate R tidyverse

5 Answers5