0

I asked this question a while ago (Recode dataframe based on one column) and the answer worked perfectly. Now however, i almost want to do the reverse. Namely, I have a (700k * 2000) of 0/1/2 or NA. In a separate dataframe I have two columns (Ref and Obs). The 0 corresponds to two instances of Ref, 1 is one instance of Ref and one instance of Obs and 2 is two Obs. To clarify, data snippet:

 Genotype File --- 
 Ref Obs   
 A    G        
 T    C
 G    C
 Ref <- c("A", "T", "G")
 Obs <- c("G", "C", "C")

 Current Data---
 Sample.1     Sample.2  .... Sample.2000
 0              1                2
 0              0                0 
 0              NA               1

mat <- matrix(nrow=3, ncol=3)
mat[,1] <- c(0,0,0)
mat[,2] <- c(1,0,NA)
mat[,3] <- c(2,0,1)

 Desired Data format--- 
 Sample.1   Sample.1   Sample.2   Sample.2   Sample.2000   Sample.2000
    A         A           A           G          G              G
    T         T           T           T          T              T
    G         G           0           0          G              C

I think that's right. The desired data format has two columns (space separated) for each sample. 0 in this format (plink ped file for the bioinformaticians out there) is missing data.

Community
  • 1
  • 1
cianius
  • 2,272
  • 6
  • 28
  • 41
  • Can you provide us with R objects? Consider using `dput`. See http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example on how to make a reproducible example. – Roman Luštrik Jan 07 '14 at 15:07
  • I would have used dput, but the genotype matrix and the support dataframe are both rather large, at 700 thousand rows, so it's just a mess. I can put it in manually though. – cianius Jan 07 '14 at 15:14
  • 1
    Make a small 5 column, 5 row example. – Roman Luštrik Jan 07 '14 at 15:20

3 Answers3

1

MAJOR ASSUMPTION: your data is in 3 element frames, i.e. you want to apply your mapping to the first 3 rows, then the next 3, and so on, which I think makes sense given DNA frames. If you want a rolling 3 element window this will not work (but code can be modified to make it work). This will work for an arbitrary number of columns, and arbitrary number of 3 row groups:

# Make up a matrix with your properties (4 cols, 6 rows)

col <- 4L
frame <- 3L
mat <- matrix(sample(c(0:2, NA_integer_), 2 * frame * col, replace=T), ncol=col)

# Mapping data

Ref <- c("A", "T", "G")
Obs <- c("G", "C", "C")
map.base <- cbind(Ref, Obs)
num.to.let <- matrix(c(1, 1, 1, 2, 2, 2), byrow=T, ncol=2) # how many from each of ref obs

# Function to map 0,1,2,NA to Ref/Obs

re_map <- function(mat.small) {  # 3 row matrices, with col columns
  t(
    mapply(                      # iterate through each row in matrix
      function(vals, map, num.to.let) {
        vals.2 <- unlist(lapply(vals, function(x) map[num.to.let[x + 1L, ]]))
        ifelse(is.na(vals.2), 0, vals.2)
      },
      vals=split(mat.small, row(mat.small)),  # a row
      map=split(map.base, row(map.base)),     # the mapping for that row
      MoreArgs=list(num.to.let=num.to.let)    # general conversion of number to Obs/Ref
  ) ) 
}
# Split input data frame into 3 row matrices (assumes frame size 3),
# and apply mapping function to each group

mat.split <- split.data.frame(mat, sort(rep(1:(nrow(mat) / frame), frame)))
mat.res <- do.call(rbind, lapply(mat.split, re_map)) 
colnames(mat.res) <- paste0("Sample.", rep(1:ncol(mat), each=2))
print(mat.res, quote=FALSE)
#   Sample.1 Sample.1 Sample.2 Sample.2 Sample.3 Sample.3 Sample.4 Sample.4
# 1 G        G        A        G        G        G        G        G       
# 2 C        C        0        0        T        C        T        C       
# 3 0        0        G        C        G        G        G        G       
# 1 A        A        A        A        A        G        A        A       
# 2 C        C        C        C        T        C        C        C       
# 3 C        C        G        G        0        0        0        0       
BrodieG
  • 51,669
  • 9
  • 93
  • 146
0

I am not sure but this could be what you need:

first same simple data

geno <- data.frame(Ref = c("A", "T", "G"), Obs = c("G", "C", "C"))
data <- data.frame(s1 = c(0,0,0),s2 = c(1, 0, NA))

then a couple of functions:

f <- function(i , x, geno){
  x <- x[i]
  if(!is.na(x)){
    if (x == 0) {y <- geno[i , c(1,1)]}
    if (x == 1) {y <- geno[i, c(1,2)]}
    if (x == 2) {y <- geno[i, c(2,2)]}
  }
  else y <- c(0,0)
  names(y) <- c("s1", "s2")
  y
}

g <- function(x, geno){
 Reduce(rbind, lapply(1:length(x), FUN = f , x = x, geno = geno))
}

The way f() is defined may not be the most elegant but it does the job

Then simply run it as a doble for loop in a lapply fashion

as.data.frame(Reduce(cbind, lapply(data , g , geno = geno )))

hope it helps

Andrea
  • 593
  • 2
  • 8
0

Here's one way based on the sample data in your answer:

# create index
idx <- lapply(data, function(x) cbind((x > 1) + 1, (x > 0) + 1))

# list of matrices
lst <- lapply(idx, function(x) {
  tmp <- apply(x, 2, function(y) geno[cbind(seq_along(y), y)])
  replace(tmp, is.na(tmp), 0)
  })

# one data frame
as.data.frame(lst)

#   s1.1 s1.2 s2.1 s2.2
# 1    A    A    A    G
# 2    T    T    T    T
# 3    G    G    0    0
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168