2

I have relatively tidy data with samples, genes, alleles, and frequencies in separate columns. For each gene and for each sample I need to split out the alleles and their corresponding frequency into separate columns. Here's what I have and what I need.

Trying to do this with dplyr/tidyr, but I'll take any solution I can get.

What I have:

data.frame(sample=rep("sample1", 10), 
           gene=rep(paste0("gene", 1:5), each=2), 
           allele=c("A", "G", "A", "C", "A", "T", "C", "G", "G", "T"), 
           freq=c(.9, .1, .8, .2, .7, .3, .6, .4, .5, .5))

#     sample  gene allele freq
# 1  sample1 gene1      A  0.9
# 2  sample1 gene1      G  0.1
# 3  sample1 gene2      A  0.8
# 4  sample1 gene2      C  0.2
# 5  sample1 gene3      A  0.7
# 6  sample1 gene3      T  0.3
# 7  sample1 gene4      C  0.6
# 8  sample1 gene4      G  0.4
# 9  sample1 gene5      G  0.5
# 10 sample1 gene5      T  0.5

What I want:

data.frame(sample=rep("sample1", 5), 
           gene=paste0("gene", 1:5), 
           allele1=c("A", "A", "A", "C", "G"), 
           allele2=c("G", "C", "T", "G", "T"), 
           freq1=c(.9, .8, .7, .6, .5), 
           freq2=c(.1, .2, .3, .4, .5))

#    sample  gene allele1 allele2 freq1 freq2
# 1 sample1 gene1       A       G   0.9   0.1
# 2 sample1 gene2       A       C   0.8   0.2
# 3 sample1 gene3       A       T   0.7   0.3
# 4 sample1 gene4       C       G   0.6   0.4
# 5 sample1 gene5       G       T   0.5   0.5
zx8754
  • 52,746
  • 12
  • 114
  • 209
Stephen Turner
  • 2,574
  • 8
  • 31
  • 44
  • 1
    Please don't include "> " and "+ " at the start of lines; they make it harder to copy-paste the code. – Frank Jul 28 '15 at 21:15

5 Answers5

7

You can use dcast from the devel version of data.tableie. 1.9.5+ which can take multiple value.var columns. We create a sequence column ('indx') grouped by 'sample' and 'gene'. Then dcast from long to wide format mentioning the value.var columns.

 library(data.table)#v1.9.5+ 
 setDT(df)[, indx:=1:.N,.(sample, gene)]
 dcast(df, sample+gene~indx, value.var=c('allele', 'freq'), sep= '')
 #    sample  gene   allele1 allele2  freq1 freq2
 #1: sample1 gene1        A        G    0.9    0.1
 #2: sample1 gene2        A        C    0.8    0.2
 #3: sample1 gene3        A        T    0.7    0.3
 #4: sample1 gene4        C        G    0.6    0.4
 #5: sample1 gene5        G        T    0.5    0.5

NOTE: Instructions to install the devel version are here

The sep='' argument is useful to create the column names as 'allele1', 'allele2' etc. The default would be `allele_1', 'allele_2' etc. (from @Arun's comments)

akrun
  • 874,273
  • 37
  • 540
  • 662
5

This uses summarise instead of true reshaping but might fit the bill.

library(dplyr)
foo <- data.frame(sample=rep("sample1", 10), 
                  gene=rep(paste0("gene", 1:5), each=2), 
                  allele=c("A", "G", "A", "C", "A", "T", "C", "G", "G", "T"), 
                  freq=c(.9, .1, .8, .2, .7, .3, .6, .4, .5, .5))

foo %>%
  group_by(sample, gene) %>% 
  summarise(allele1 = first(allele), allele2 = last(allele),
            freq1 = first(freq), freq2 = last(freq))

## Source: local data frame [5 x 6]
## Groups: sample
## 
##    sample  gene allele1 allele2 freq1 freq2
## 1 sample1 gene1       A       G   0.9   0.1
## 2 sample1 gene2       A       C   0.8   0.2
## 3 sample1 gene3       A       T   0.7   0.3
## 4 sample1 gene4       C       G   0.6   0.4
## 5 sample1 gene5       G       T   0.5   0.5
jennybryan
  • 2,606
  • 2
  • 18
  • 33
1

You can join the table on itself and then select the appropriate rows.

library(dplyr)

table <- data.frame(sample=rep("sample1", 10), 
              gene=rep(paste0("gene", 1:5), each=2), 
              allele=c("A", "G", "A", "C", "A", "T", "C", "G", "G", "T"), 
              freq=c(.9, .1, .8, .2, .7, .3, .6, .4, .5, .5))

inner_join(table, table, by=c("sample","gene")) %>%
filter(allele.x != allele.y,
      (freq.x > freq.y | (freq.x == freq.y & as.numeric(allele.x) < as.numeric(allele.y))))
0

Provided the data is sorted on sample and gene columns, and every gene has 2 rows, then we could try below:

cbind( df[ seq(1,nrow(df),2), ],
       df[ seq(2,nrow(df),2), -c(1,2) ] ) 

#output
#    sample  gene allele freq allele freq
# 1 sample1 gene1      A  0.9      G  0.1
# 3 sample1 gene2      A  0.8      C  0.2
# 5 sample1 gene3      A  0.7      T  0.3
# 7 sample1 gene4      C  0.6      G  0.4
# 9 sample1 gene5      G  0.5      T  0.5
zx8754
  • 52,746
  • 12
  • 114
  • 209
0

I struggle with this question myself. The classic reshape is arguably the clearest here?

df <- data.frame(sample=rep("sample1", 10), 
           gene=rep(paste0("gene", 1:5), each=2), 
           allele=c("A", "G", "A", "C", "A", "T", "C", "G", "G", "T"), 
           freq=c(.9, .1, .8, .2, .7, .3, .6, .4, .5, .5))

# make a time index
df$time = as.numeric( duplicated( df$gene ) ) # rep( c(1,2), nrow(df)/2 )
# reshape
reshape( df, idvar="gene", v.names=c("allele", "freq"), timevar="time", direction="wide")

But I really want a clean answer from the tidyverse!

miratrix
  • 191
  • 2
  • 12