Recode dataframe based on one column

Question

I have a 5845*1095 (rows*columns) data frame that looks like this:

 9  286593   C     C/C     C/A     A/A
 9  334337   A     A/A     G/A     A/A
 9  390512   C     C/C     C/C     C/C

c <-  c("9", "286593", "C", "C/C", "C/A", "A/A") 
d <-  c("9", "334337", "A", "A/A", "G/A", "A/A")
e <-   c("9", "390512", "C", "C/C", "C/C", "C/C")
dat <- data.frame(rbind(c,d,e))

I want the values in the third column to be used to change the columns to its right so if (per row 1) column 3 is "C", then column 4 is turned from "C/C" to "0" as it has the same letter. One letter match is "1" (can be first or second letter) and no letter match is "2" .

9 286593  C  0  1  2
9 334337  A  0  1  0
9 390512  C  0  0  0 

c <-  c("9", "286593", "C", "0", "1", "2") 
d <-  c("9", "334337", "A", "0", " 1", "0")
e <-   c("9", "390512", "C", "0", "0", "0")
dat <- data.frame(rbind(c,d,e))

I am interested to see the best way to do this as I want to get out of the habit of using nested For loops in R.

Are `A,C,G,T` the only alphabets? Or you also have `N` and other alphabets..? — Arun, Jun 24 '13 at 14:59
It should just be A,C,G,T yes. You can tell it's DNA can't you :) — cianius, Jun 24 '13 at 15:03
As a bioinformatician myself, it'd be a problem if I dint :). — Arun, Jun 24 '13 at 15:05
Just a tip towards a possible solution: `length(grep(dat$X3,dat$X4))` — zx8754, Jun 24 '13 at 15:12
I hoped that with no missing data would have been easier, but no... I'm trying but no solution yet :-( — Fabio Marroni, Jun 24 '13 at 15:14
@TylerRinker, the example data is different from the data in the `data.frame`. The actual value there is `A/A` and they are different from `C` both before and after the `/`. — Arun, Jun 24 '13 at 15:22
A lot of nice solutions! Thank you all! The step of computing identity from a dataset like the one proposed by pepsimax is very common in genetics, and your help is greatly appreciated. — Fabio Marroni, Jun 24 '13 at 15:54

Thilo · Accepted Answer · 2013-06-24T15:54:14.810

First your data:

c <-  c("9", "286593", "C", "C/C", "C/A", "A/A")
# Note: In your original data, you had a space in "G/A", which I did remove. 
# If this was no mistake, we would also have to deal with the space.
d <-  c("9", "334337", "A", "A/A", "G/A", "A/A")
e <-   c("9", "390512", "C", "C/C", "C/C", "C/C")
dat <- data.frame(rbind(c,d,e))

Now we generate us a vector that has all the possible letters available.

values <- c("A", "C", "G", "T")
dat$X3 <- factor(dat$X3, levels=values) # This way we just ensure that it will later on be possible to compare the reference values to our generated data. 

# Generate all possible combinations of two letters
combinations <- expand.grid(f=values, s=values)
combinations <- cbind(combinations, v=with(combinations, paste(f, s, sep='/')))

The main function finds the correct columns of each combination of each column and then compares this to the reference column 3.

compare <- function(col, val) {
    m <- match(col, combinations$v)
    2 - (combinations$f[m] == val) - (combinations$s[m] == val)
}

Finally we use apply to run the function on all columns that have to be changed. You probably want to change the 6 to your actual number of columns.

dat[,4:6] <- apply(dat[,4:6], 2, compare, val=dat[,3])

Note that this solution compared to the other solutions up to now does not use string comparison but an approach purely based on factor levels. Would be interesting to see which one performs better.

Edit

I just did some benchmarking:

    test replications elapsed relative user.self sys.self user.child sys.child
1   arun      1000000   2.881    1.116     2.864    0.024          0         0
2  fabio      1000000   2.593    1.005     2.558    0.030          0         0
3 roland      1000000   2.727    1.057     2.687    0.048          0         0
5  thilo      1000000   2.581    1.000     2.540    0.036          0         0
4  tyler      1000000   2.663    1.032     2.626    0.042          0         0

which leaves my version slightly faster. However, the difference is close to nothing, so you are probably fine with every single approach. And to be fair: I did not benchmark the part where I add additional factor levels. Doing this as well would probably rule my version out.

a benchmark on large size is likely much more informative than one on large number of replications — eddi, Jun 24 '13 at 20:22
that said I'm pretty sure this is the fastest solution for basically any size data — eddi, Jun 24 '13 at 20:31
@eddi: Oh, I forgot to mention: Before replicating the data, I resampled the three rows such that `dat` contain 10000 rows of data. Thus, there was also a large size of data. — Thilo, Jun 25 '13 at 06:18
Your answer gets accepted (even though there's something I can learn from all of them) because of the benchmarking showing yours is the fastest. SO makes me love programming even more. — cianius, Jun 25 '13 at 09:05
Wow! Nice work (and I'm the second fastest!)! Could you add eddi's solution? The number of SNPs is easily in the order of hundreds of thousands so having a fast approach is great! — Fabio Marroni, Jun 25 '13 at 09:31

score 4 · Answer 2 · answered Jun 24 '13 at 15:25

Here is one approache:

FUN <- function(x) {
    a <- strsplit(as.character(unlist(x[-1])), "/")
    b <- sapply(a, function(y) sum(y %in% as.character(unlist(x[1]))))
    2 - b
}

dat[4:6] <-  t(apply(dat[, 3:6], 1, FUN))

## > dat
##   X1     X2 X3 X4 X5 X6
## c  9 286593  C  0  1  2
## d  9 334337  A  0  1  0
## e  9 390512  C  0  0  0

Arun · Answer 3 · 2013-06-24T15:44:52.553

4

Here's one way using apply:

out <- apply(dat[, -(1:2)], 1, function(x) 
        2 - grepl(x[1], x[-1]) -  
        x[-1] %in% paste(x[1], x[1], sep="/"))
cbind(dat[, (1:3)], t(out))

edited Jun 24 '13 at 15:44

answered Jun 24 '13 at 15:31

Arun

116,683
26
284
387

score 3 · Answer 4 · answered Jun 24 '13 at 15:26

This solution is not very efficient:

dat <-  cbind(dat[,-(4:6)],
              t(sapply(seq_len(nrow(dat)),function(i){
                res <- dat[i,]
                res[,4:6] <- lapply(res[,4:6],function(x) 2-sum(gregexpr(res[,3],x)[[1]]>0))
              })))

#  X1     X2 X3 X4 X5 X6
#c  9 286593  C  0  1  2
#d  9 334337  A  0  1  0
#e  9 390512  C  0  0  0

Fabio Marroni · Answer 5 · 2013-06-24T15:39:10.420

2

Ugly, but it works!

fff<-apply(dat[,4:ncol(dat)],2,substr,1,1)!=dat[,3]
ggg<-apply(dat[,4:ncol(dat)],2,substr,3,3)!=dat[,3]
final<-fff+ggg
cbind(dat,final)
X1     X2 X3  X4  X5  X6 X4 X5 X6
c  9 286593  C C/C C/A A/A  0  1  2
d  9 334337  A A/A G/A A/A  0  1  0
e  9 390512  C C/C C/C C/C  0  0  0

edited Jun 24 '13 at 15:39

answered Jun 24 '13 at 15:28

Fabio Marroni

423
8
19

eddi · Answer 6 · 2013-06-24T20:27:48.750

2

Another contribution to R-golf:

cbind(dat[, 1:3],
      apply(dat[, -(1:3)], 2, function(x) {
        2 - (dat[[3]] == gsub('..$', '', x)) - (dat[[3]] == gsub('^..', '', x))
      }))

edited Jun 24 '13 at 20:27

answered Jun 24 '13 at 20:21

eddi

49,088
6
104
155

Recode dataframe based on one column

6 Answers6

Edit

Linked