-2

Hi given the following dataframe

library(tidyverse)

df <- data.frame(READS=rep(c('READa', 'READb', 'READc'),each=3) ,GENE=rep(c('GENEa', 'GENEb', 'GENEc'), each=3), COMMENT=rep(c('CommentA', 'CommentA', 'CommentA'),each=3))
> df
  READS  GENE  COMMENT
1 READa GENEa CommentA
2 READa GENEa CommentA
3 READa GENEa CommentA
4 READb GENEb CommentA
5 READb GENEb CommentA
6 READb GENEb CommentA
7 READc GENEc CommentA
8 READc GENEc CommentA
9 READc GENEc CommentA

I want to convert from long to wide format aggregating by Gene Column so that i get the following

         GENEa   GENEb  GENEc
READSa     3        3     3 
READSb     3        3     3

I have tried with no success:

 library(tidyverse)
      df %>% 
      group_by(GENE) %>% 
      select(-COMMENT) %>%
      spread(READS) 

Note that the original dataframe is huge so any optimized code would help.

Thanks for your help.

david
  • 805
  • 1
  • 9
  • 21

5 Answers5

2

Not really sure how you get the 3 count for GENEa and READSb, but assuming you want the count, you can try the following:


library(tidyverse)

df <- tibble(
  READS = rep(c("READa", "READb", "READc"), each = 3), 
  GENE = rep(c("GENEa", "GENEb", "GENEc"), each = 3), 
  COMMENT = rep(c("CommentA", "CommentA", "CommentA"), each = 3)
)
df
#> # A tibble: 9 x 3
#>   READS GENE  COMMENT 
#>   <chr> <chr> <chr>   
#> 1 READa GENEa CommentA
#> 2 READa GENEa CommentA
#> 3 READa GENEa CommentA
#> 4 READb GENEb CommentA
#> 5 READb GENEb CommentA
#> 6 READb GENEb CommentA
#> 7 READc GENEc CommentA
#> 8 READc GENEc CommentA
#> 9 READc GENEc CommentA

df %>%
  count(READS, GENE) %>%
  pivot_wider(
    names_from = GENE, values_from = n,
    values_fill = list(n = 0)
  )
#> # A tibble: 3 x 4
#>   READS GENEa GENEb GENEc
#>   <chr> <int> <int> <int>
#> 1 READa     3     0     0
#> 2 READb     0     3     0
#> 3 READc     0     0     3

Created on 2019-12-13 by the reprex package (v0.3.0)

David
  • 9,216
  • 4
  • 45
  • 78
2

Assuming that you intended that the number in each output cell is the number of rows in the input that have that cell's row and column name then this is a one-liner in base R.

table(df[1:2])

giving this table class object:

       GENE
READS   GENEa GENEb GENEc
  READa     3     0     0
  READb     0     3     0
  READc     0     0     3

If you want the result as a data frame then:

as.data.frame.matrix(table(df[1:2]))
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
1
library(tidyr) #v1.0.0
pivot_wider(df, -COMMENT, names_from = GENE, values_from = GENE, 
                          values_fn = list(GENE = length), values_fill = list(GENE=0))

# A tibble: 3 x 4
  READS GENEa GENEb GENEc
  <fct> <int> <int> <int>
1 READa     3     0     0
2 READb     0     3     0
3 READc     0     0     3
A. Suliman
  • 12,923
  • 5
  • 24
  • 37
1

An option with dcast

library(data.table)
dcast(setDT(df), READS ~ GENE, length)
#   READS GENEa GENEb GENEc
#1: READa     3     0     0
#2: READb     0     3     0
#3: READc     0     0     3
akrun
  • 874,273
  • 37
  • 540
  • 662
0

Given that some combinations of your desired output do not exist:

df <- data.frame(READS=rep(c('READa', 'READb', 'READc'),each=3) ,GENE=rep(c('GENEa', 'GENEb', 'GENEc'), each=3), COMMENT=rep(c('CommentA', 'CommentA', 'CommentA'),each=3))

df %>%
  group_by(READS, GENE) %>% 
  summarise(count = n()) %>% 
  spread(key = "GENE", value = "count") 

would result in

  READS GENEa GENEb GENEc
1 READa     3    NA    NA
2 READb    NA     3    NA
3 READc    NA    NA     3

Note that spread was deprecated, in the new versions you should use pivot_wider.

Annet
  • 846
  • 3
  • 14