6

I have a 5845*1095 (rows*columns) data frame that looks like this:

 9  286593   C     C/C     C/A     A/A
 9  334337   A     A/A     G/A     A/A
 9  390512   C     C/C     C/C     C/C

c <-  c("9", "286593", "C", "C/C", "C/A", "A/A") 
d <-  c("9", "334337", "A", "A/A", "G/A", "A/A")
e <-   c("9", "390512", "C", "C/C", "C/C", "C/C")
dat <- data.frame(rbind(c,d,e))

I want the values in the third column to be used to change the columns to its right so if (per row 1) column 3 is "C", then column 4 is turned from "C/C" to "0" as it has the same letter. One letter match is "1" (can be first or second letter) and no letter match is "2" .

9 286593  C  0  1  2
9 334337  A  0  1  0
9 390512  C  0  0  0 

c <-  c("9", "286593", "C", "0", "1", "2") 
d <-  c("9", "334337", "A", "0", " 1", "0")
e <-   c("9", "390512", "C", "0", "0", "0")
dat <- data.frame(rbind(c,d,e))

I am interested to see the best way to do this as I want to get out of the habit of using nested For loops in R.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
cianius
  • 2,272
  • 6
  • 28
  • 41

6 Answers6

5

First your data:

c <-  c("9", "286593", "C", "C/C", "C/A", "A/A")
# Note: In your original data, you had a space in "G/A", which I did remove. 
# If this was no mistake, we would also have to deal with the space.
d <-  c("9", "334337", "A", "A/A", "G/A", "A/A")
e <-   c("9", "390512", "C", "C/C", "C/C", "C/C")
dat <- data.frame(rbind(c,d,e))

Now we generate us a vector that has all the possible letters available.

values <- c("A", "C", "G", "T")
dat$X3 <- factor(dat$X3, levels=values) # This way we just ensure that it will later on be possible to compare the reference values to our generated data. 

# Generate all possible combinations of two letters
combinations <- expand.grid(f=values, s=values)
combinations <- cbind(combinations, v=with(combinations, paste(f, s, sep='/')))

The main function finds the correct columns of each combination of each column and then compares this to the reference column 3.

compare <- function(col, val) {
    m <- match(col, combinations$v)
    2 - (combinations$f[m] == val) - (combinations$s[m] == val)
}

Finally we use apply to run the function on all columns that have to be changed. You probably want to change the 6 to your actual number of columns.

dat[,4:6] <- apply(dat[,4:6], 2, compare, val=dat[,3])

Note that this solution compared to the other solutions up to now does not use string comparison but an approach purely based on factor levels. Would be interesting to see which one performs better.

Edit

I just did some benchmarking:

    test replications elapsed relative user.self sys.self user.child sys.child
1   arun      1000000   2.881    1.116     2.864    0.024          0         0
2  fabio      1000000   2.593    1.005     2.558    0.030          0         0
3 roland      1000000   2.727    1.057     2.687    0.048          0         0
5  thilo      1000000   2.581    1.000     2.540    0.036          0         0
4  tyler      1000000   2.663    1.032     2.626    0.042          0         0

which leaves my version slightly faster. However, the difference is close to nothing, so you are probably fine with every single approach. And to be fair: I did not benchmark the part where I add additional factor levels. Doing this as well would probably rule my version out.

Thilo
  • 8,827
  • 2
  • 35
  • 56
  • a benchmark on large size is likely much more informative than one on large number of replications – eddi Jun 24 '13 at 20:22
  • that said I'm pretty sure this is the fastest solution for basically any size data – eddi Jun 24 '13 at 20:31
  • @eddi: Oh, I forgot to mention: Before replicating the data, I resampled the three rows such that `dat` contain 10000 rows of data. Thus, there was also a large size of data. – Thilo Jun 25 '13 at 06:18
  • 1
    Your answer gets accepted (even though there's something I can learn from all of them) because of the benchmarking showing yours is the fastest. SO makes me love programming even more. – cianius Jun 25 '13 at 09:05
  • Wow! Nice work (and I'm the second fastest!)! Could you add eddi's solution? The number of SNPs is easily in the order of hundreds of thousands so having a fast approach is great! – Fabio Marroni Jun 25 '13 at 09:31
4

Here is one approache:

FUN <- function(x) {
    a <- strsplit(as.character(unlist(x[-1])), "/")
    b <- sapply(a, function(y) sum(y %in% as.character(unlist(x[1]))))
    2 - b
}

dat[4:6] <-  t(apply(dat[, 3:6], 1, FUN))

## > dat
##   X1     X2 X3 X4 X5 X6
## c  9 286593  C  0  1  2
## d  9 334337  A  0  1  0
## e  9 390512  C  0  0  0
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
4

Here's one way using apply:

out <- apply(dat[, -(1:2)], 1, function(x) 
        2 - grepl(x[1], x[-1]) -  
        x[-1] %in% paste(x[1], x[1], sep="/"))
cbind(dat[, (1:3)], t(out))
Arun
  • 116,683
  • 26
  • 284
  • 387
3

This solution is not very efficient:

dat <-  cbind(dat[,-(4:6)],
              t(sapply(seq_len(nrow(dat)),function(i){
                res <- dat[i,]
                res[,4:6] <- lapply(res[,4:6],function(x) 2-sum(gregexpr(res[,3],x)[[1]]>0))
              })))

#  X1     X2 X3 X4 X5 X6
#c  9 286593  C  0  1  2
#d  9 334337  A  0  1  0
#e  9 390512  C  0  0  0
Roland
  • 127,288
  • 10
  • 191
  • 288
2

Ugly, but it works!

fff<-apply(dat[,4:ncol(dat)],2,substr,1,1)!=dat[,3]
ggg<-apply(dat[,4:ncol(dat)],2,substr,3,3)!=dat[,3]
final<-fff+ggg
cbind(dat,final)
X1     X2 X3  X4  X5  X6 X4 X5 X6
c  9 286593  C C/C C/A A/A  0  1  2
d  9 334337  A A/A G/A A/A  0  1  0
e  9 390512  C C/C C/C C/C  0  0  0
Fabio Marroni
  • 423
  • 8
  • 19
2

Another contribution to R-golf:

cbind(dat[, 1:3],
      apply(dat[, -(1:3)], 2, function(x) {
        2 - (dat[[3]] == gsub('..$', '', x)) - (dat[[3]] == gsub('^..', '', x))
      }))
eddi
  • 49,088
  • 6
  • 104
  • 155