0

In attempt to replace mismatches between the two data frames below I've already managed to create a new data frame in which mismatches are replaced. I am now looking for a more efficient way to do this using ifelse or data.table package:

dfA <- structure(list(animal1 = c("AA", "TT", "AG", "CA"), animal2 = c("AA", "TB", "AG", "CA"), animal3 = c("AA", "TT", "AG", "CA")), .Names = c("animal1", "animal2", "animal3"), row.names = c("snp1", "snp2", "snp3", "snp4"), class = "data.frame")
# > dfA
#      animal1 animal2 animal3
# snp1      AA      AA      AA
# snp2      TT      TB      TT
# snp3      AG      AG      AG
# snp4      CA      CA      CA
dfB <- structure(list(animal1 = c("AA", "TT", "AG", "CA"), animal2 = c("AA", "TB", "AG", "DF"), animal3 = c("AA", "TB", "AG", "DF")), .Names = c("animal1", "animal2", "animal3"), row.names = c("snp1", "snp2", "snp3", "snp4"), class = "data.frame")
#> dfB
#     animal1 animal2 animal3
#snp1      AA      AA      AA
#snp2      TT      TB      TB
#snp3      AG      AG      AG
#snp4      CA      DF      DF

When there is more than 50% mismatches in a row I assign "00" to all columns of the snp:

dfC <- do.call(rbind, lapply(rownames(dfA), function(x){
    mismatchpercentage <- length(which(dfA[x,] != dfB[x,]) == FALSE) / length(dfA[x,]) 
    if(mismatchpercentage > 0.5){
        dfA[x,] <- "00"
    }
    dfA[x, which(dfA[x,] != dfB[x,])] <- "00"
    dfA[x,]
    }))
data.frame(dfC)

# > data.frame(dfC)
#      animal1 animal2 animal3
# snp1      AA      AA      AA
# snp2      TT      TB      00
# snp3      AG      AG      AG
# snp4      00      00      00

a part of this can be done by the following code, however this is only a part of the solution, now I need to replace the last line with all 00's:

as.data.frame(ifelse(as.matrix(dfA) == as.matrix(dfB), as.matrix(dfA), "00"))
#      animal1 animal2 animal3
# snp1      AA      AA      AA
# snp2      TT      TB      00
# snp3      AG      AG      AG
# snp4      CA      00      00
Bas
  • 1,066
  • 1
  • 10
  • 28
  • @akrun the required output in both of my other question is different then from this one. The other question is basically a followup. – Bas Apr 13 '16 at 09:18
  • The dupe link was provided by somebody. I just clicked on it after an initial dupe vote was given by the other person. – akrun Apr 13 '16 at 09:19
  • This question is not a duplicate – Bas Apr 22 '16 at 10:14

1 Answers1

2

This can implement your 50%-rule:

dfA.m <- as.matrix(dfA)
dfB.m <- as.matrix(dfB)
i.arr <- which(dfA.m != dfB.m, arr.ind=TRUE)
mm <- (dfA.m != dfB.m)  # mismatches
mm[rowSums(mm) > ncol(dfA.m)/2, ] <- TRUE
jogo
  • 12,469
  • 11
  • 37
  • 42
  • works like a charm, I've changed it to my required output adding this: `ifelse(mm, "00", dfA.m)` – Bas Apr 13 '16 at 09:22