-1

I have a big set of data that looks like the following:

Name    SNP.x   ILMN.Strand.x   Customer.Strand.x   SNP.y   ILMN.Strand.y   Customer.Strand.y   
exm-rs10128711  [T/C]   BOT BOT [T/C]   BOT BOT
exm-rs10134944  [A/G]   TOP BOT NA  NA  NA  
exm-rs10218696    NA    NA  NA [T/C] BOT TOP
exm-rs10223421  [A/C]   TOP BOT NA  NA  NA

How do I create new columns "SNP","ILMN.Strand","Customer.Strand", whereby:

  1. if (SNP.x = SNP.y), then "SNP","ILMN.Strand","Customer.Strand" would be from "SNP.x","ILMN.Strand.x","Customer.Strand.x"
  2. if (SNP.x is not equal to SNP.y), and SNP.x is NA (missing value), then the values in the new columns should be taken from "SNP.y","ILMN.Strand.y","Customer.Strand.y"

  3. if (SNP.x is not equal to SNP.y), and SNP.y is NA (missing value), then the values in the new columns should be taken from "SNP.x","ILMN.Strand.x","Customer.Strand.x"

Many thanks in advance! :)

Miyii
  • 141
  • 1
  • 1
  • 8

2 Answers2

0

I am assuming, that if both SNP.x and SNP.y are NA, the row is dropped from the dataframe. If SNP.x != SNP.y the row is also dropped (if that case were to occur).

The code below is not pretty or very efficient, but it ought to do the trick.

tmp <- apply(df, 1, function(x){
  # if SNP.x == SNP.y and not NA pass X
  if(!is.na(x["SNP.x"] == x["SNP.y"])) {
    if(x["SNP.x"] == x["SNP.y"]) data.frame(Name = x["Name"], SNP = x["SNP.x"],  ILMN.Strand = x["ILMN.Strand.x"], Customer.Strand = x["Customer.Strand.x"])
  } else if(is.na(x["SNP.x"])) { # else if SNP.x is NA pass y
    if(!is.na(x["SNP.y"])) data.frame(Name = x["Name"], SNP = x["SNP.y"],  ILMN.Strand = x["ILMN.Strand.y"], Customer.Strand = x["Customer.Strand.y"])
  } else if(is.na(x["SNP.y"])) { # else if SNP.y is NA pass x
    if(!is.na(x["SNP.x"])) data.frame(Name = x["Name"], SNP = x["SNP.x"],  ILMN.Strand = x["ILMN.Strand.x"], Customer.Strand = x["Customer.Strand.x"])
  } else NULL # otherwise pass NULL (e.g. (SNP.x != SNP.y AND neither are NA))
})

# rbind the list-output of the previous apply() function
result <- do.call(rbind, tmp[!sapply(tmp, is.null)])

The result is a dataframe with the following structure:

str(result)

'data.frame':   81 obs. of  4 variables:
 $ Name           : Factor w/ 81 levels "exm-rs666","exm-rs3510",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ SNP            : Factor w/ 2 levels "[A/C]","[T/G]": 1 2 1 1 1 2 2 2 2 2 ...
 $ ILMN.Strand    : Factor w/ 2 levels "TOP","BOT": 1 1 1 2 1 1 2 2 1 1 ...
 $ Customer.Strand: Factor w/ 2 levels "TOP","BOT": 1 1 1 2 1 2 1 1 1 1 ...

EDIT:

This might be the better solution (R 3.2.4 with dplyr 0.5.0), since apply() coerces the dataframe into a matrix etc. The solution below also returns a unique '..', if (SNP.X != SNP.Y) and both are NOT NA. Hope this will do the trick, although without more information on your data it is hard to anticipate which problems you may run into. In this solution factors are coerced into characters, so keep that in mind.

# This is a helper function for the logic
# a and b will be tested; retA, retB, NA or '..' (see below) will be returned
logicalTest <- function(a, b, retA, retB){ 

  # coerce factors into character
  if(is.factor(retA)) retA <- as.character(retA)
  if(is.factor(retB)) retB <- as.character(retB)

  tmp <- a == b                        # compare a and b (surrogates for SNP.x and SNP.y) and put in tmp variable

  if(is.na(tmp)){                      # if the comparison was NA one of a or b must have been NA ...
    if(is.na(a) & is.na(b)) return(NA)  # if both were NA just return NA,
    else if(is.na(a)) return(retB)      # if a was NA return b,
    else return(retA)                   # otherwise return a
  } else if(tmp){                      # if tmp is TRUE (a == b)
    return(retA)                        # return a
  } else return("..")                  # else (a != b) return ".."
}

# load dplyr for the bit below
library(dplyr)

result <- df %>% 
  group_by(Name) %>% 
  transmute(SNP = logicalTest(SNP.x, SNP.y, SNP.x, SNP.y),
            ILMN.Strand = logicalTest(SNP.x, SNP.y, ILMN.Strand.x, ILMN.Strand.y),
            Customer.Strand = logicalTest(SNP.x, SNP.y, Customer.Strand.x, Customer.Strand.y))

# get cleaned results
result[!rowSums(is.na(result)),] # drop rows with NAs
result[!(rowSums(is.na(result)) | result$SNP == ".."),] # drop rows with NAs and ".."
DrEigelb
  • 588
  • 4
  • 8
  • Hi impz, I tried running your code, but I received an error msg instead.. "Error in apply(df, 1, function(x) { : dim(X) must have a positive length" I'm a noob at this. Could you please advise again? Thanks! – Miyii Jul 13 '16 at 02:34
  • this is probably due to a coercion problem. i included another solution, that should work better – DrEigelb Jul 13 '16 at 08:44
0

What about the data.table?

I ma not sure how the logic should work (cases like SNP.x != SNP.y or both NA's, but you can amend it yourself.

Edit: Few approaches benchmarked.

Prepare data:

require(data.table)
require(microbenchmark)

dat1 <- data.table(Name = c("exm-rs10128711", "exm-rs10134944", "exm-rs10218696", "exm-rs10223421", "both_NAs", "no_NAs_just_diff"),
                   SNP.x = c("[T/C]", "[A/G]", NA, "[A/C]", NA, "new_x"),
                   ILMN.Strand.x = c("BOT", "TOP", NA, "TOP", "new_x", "new_x"),
                   Customer.Strand.x = c("BOT", "BOT", NA, "BOT", "new_x", "new_x"),
                   SNP.y = c("[T/C]", NA, "[T/C]", NA, NA, "new_y"),
                   ILMN.Strand.y = c("BOT", NA, "BOT", NA, "new_y", "new_y"),
                   Customer.Strand.y = c("BOT", NA, "TOP", NA, "new_y", "new_y"))

# Make it a bit bigger
for (i in seq_len(15)) dat1 <- rbind(dat1, dat1)  # 15 MB, 196608 rows

# If needed cast to characters (to get rid of "level sets of factors are different" error...)
# dat <- dat[, lapply(.SD, as.character)]

Functions:

# if else returning a list
f1 <- function() {
  dat1[, c("SNP", "ILMN.Strand", "Customer.Strand") :=
         if        ( !is.na(SNP.x) ) { list(SNP.x, ILMN.Strand.x, Customer.Strand.x)
         } else if ( !is.na(SNP.y) ) { list(SNP.y, ILMN.Strand.y, Customer.Strand.y)
         } else                      { list(NA_character_, NA_character_, NA_character_) },
       by = seq_len(nrow(dat1))
       ][]
}

# ifelse per column
f2 <- function() {
  dat1[, ":="(SNP = ifelse(!is.na(SNP.x), SNP.x,
                           ifelse(!is.na(SNP.y), SNP.y, NA_character_)),
              ILMN.Strand = ifelse(!is.na(SNP.x), ILMN.Strand.x,
                                   ifelse(!is.na(SNP.y), ILMN.Strand.y, NA_character_)),
              Customer.Strand = ifelse(!is.na(SNP.x), Customer.Strand.x,
                                       ifelse(!is.na(SNP.y), Customer.Strand.y, NA_character_)))
       ][]
}

# ifelse returning a list
f3 <- function() {
  dat1[, c("SNP", "ILMN.Strand", "Customer.Strand") :=
         ifelse (!is.na(SNP.x), list(list(SNP.x, ILMN.Strand.x, Customer.Strand.x)),
                 ifelse (!is.na(SNP.y), list(list(SNP.y, ILMN.Strand.y, Customer.Strand.y)),
                                        list(list(NA_character_, NA_character_, NA_character_))))[[1]]  # HERE IS THE ONE!
       ][]
}

Benchmarking

microbenchmark(
  d1 <- f1(),
  d2 <- f2(),
  d3 <- f3(),
  times = 5)

# Unit: milliseconds
#        expr       min        lq     mean    median       uq      max neval cld
#  d1 <- f1() 303.03681 316.91054 354.9147 330.91177 403.3858 420.3286     5  b 
#  d2 <- f2() 658.27527 660.19131 723.9005 664.31352 737.0994 899.6230     5   c
#  d3 <- f3()  78.20754  84.91487 110.3533  86.73539 104.9149 196.9938     5 a  

d1[1:6, ]
#                Name SNP.x ILMN.Strand.x Customer.Strand.x SNP.y ILMN.Strand.y Customer.Strand.y   SNP ILMN.Strand Customer.Strand
# 1:   exm-rs10128711 [T/C]           BOT               BOT [T/C]           BOT               BOT [T/C]         BOT             BOT
# 2:   exm-rs10134944 [A/G]           TOP               BOT    NA            NA                NA [A/G]         TOP             BOT
# 3:   exm-rs10218696    NA            NA                NA [T/C]           BOT               TOP [T/C]         BOT             TOP
# 4:   exm-rs10223421 [A/C]           TOP               BOT    NA            NA                NA [A/C]         TOP             BOT
# 5:         both_NAs    NA         new_x             new_x    NA         new_y             new_y    NA          NA              NA
# 6: no_NAs_just_diff new_x         new_x             new_x new_y         new_y             new_y new_x       new_x           new_x

sapply(list(d1, d2, d3), FUN = identical, d1)
# [1] TRUE TRUE TRUE

Comments

f2 is here only because I could not figure out how to return a list from the ifelse, just by luck I got this double list idea used in f3.

To read about multiple assigments in data.table refer to e.g. Assign multiple columns using := in data.table, by group

Smaller sets

96 rows:

# Unit: microseconds
#        expr      min       lq     mean   median       uq      max neval cld
#  d1 <- f1() 1964.988 1968.936 2238.697 2273.276 2404.722 2581.564     5   b
#  d2 <- f2()  976.574  998.284 1147.020 1033.021 1038.942 1688.280     5  a 
#  d3 <- f3()  684.471  845.916 1026.389 1141.573 1209.466 1250.519     5  a 

6144 rows:

# Unit: milliseconds
#        expr       min        lq      mean   median        uq       max neval cld
#  d1 <- f1() 11.977032 12.128610 13.869310 12.52532 12.585317 20.130271     5  b 
#  d2 <- f2() 17.200552 17.627260 21.616209 20.76224 22.830254 29.660738     5   c
#  d3 <- f3()  2.945114  3.009456  3.317191  3.04064  3.071429  4.519314     5 a  
Community
  • 1
  • 1
m-dz
  • 2,342
  • 17
  • 29
  • Hi m-dz, unfortunately, I was unable to load or install microbenchmark library as it was not found. Am I missing something here? – Miyii Jul 13 '16 at 02:39
  • Any errors? It is on CRAN, https://cran.r-project.org/web/packages/microbenchmark/index.html. But it is completely irrelevant to the question; actually everything you will get from it is pasted above (the timings). Is it what you asked for? – m-dz Jul 13 '16 at 08:50