2

I have one data frames as below, df1 and df2:

# data
df1 <- read.table(text = "
  SNP    CHR            BP A1      A2       zscore       P CEUmaf    LOC
rs58043752      1       3344877  A       G       0.289   0.7726  .  1:3344877
rs2483242       1       3345145  A       T       0.393   0.6946  .  1:3345145
rs1572039       1       3345216  T       C       0.443   0.658   .  1:3345216
rs1537407       1       3345705  T       C       -0.289  0.7726  .  1:3345705
rs2493277       1       3346348  C       G       -1.552  0.1207  0.09167  1:3346348
rs11583353      1       3346403  C       T       -0.414  0.6786  0.875  1:3346403",
                  header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text = "
  CHR         POS         ID     AA      DA          DAF               SDS              LOC
1       3344877 rs58043752      G       A       0.1095  0.80517243505521        1:3344877
1       3345145 rs2483242       T       A       0.5746  0.741513997303754       1:3345145
1       3345216 rs1572039       T       C       0.0784  0.130228249846394       1:3345216
1       3345705 rs1537407       C       T       0.798   0.275710355505832       1:3345705
1       3346348 rs2493277       G       C       0.5737  0.283452115383779       1:3346348
1       3346403 rs11583353      C       T       0.2238  -0.0246952604330743     1:3346403", 
                  header = TRUE, stringsAsFactors = FALSE)

I have a third dataframe (df3) like:

Input_SNP  SDS
1:3344877   NA 
1:3345145   NA   
1:3345216   NA  
1:3345705   NA
1:3346348   NA   
1:3346403   NA 

I want to compare A1 and A2 of df1 to AA and DA of df2 and then have an output to a third df3. My logic is as follows:

  1. If the df1$zscore in df1 is positive: I want to see if df1$A1 == df2$DA, if yes then I want to put df2$SDS into df3$SDS. If df1$A1 == df2$AA, then I want to put the NEGATIVE of df2$SDS into df3$SDS.
  2. If the df1$zscore in df1 is negative: I want to see if df1$A2 == df2$DA, if yes then I want to put the df2$SDS into df3$SDS If df1$A2 == df2$AA, then I want to put the NEGATIVE of df2$SDS into df3$SDS

The final output would thus look like:

    Input_SNP      SDS
    1:3344877   0.805
    1:3345145   0.742   
    1:3345216   -0.130  
    1:3345705   -0.276
    1:3346348   -0.283   
    1:3346403   -0.025
zx8754
  • 52,746
  • 12
  • 114
  • 209
Evan
  • 1,477
  • 1
  • 17
  • 34
  • Could you please post a data sample with `dput()`? Why don't you write a loop to compare df1 and df2 with your conditions? Am I missing something? – nevrome Nov 14 '16 at 12:15
  • 1
    Use merge on "SNP CHR BP" columns then use your logic to filter. – zx8754 Nov 14 '16 at 12:23
  • 1
    Possible duplicate of [How to join (merge) data frames (inner, outer, left, right)?](http://stackoverflow.com/questions/1299871) ? – zx8754 Nov 14 '16 at 12:24

3 Answers3

3

Here is another approach. First we determine where the zscore is negative. Then choose which column will be matched over to df2. Next identify which df2 column matched up. Line 4 is a safeguard for cases where there is no match between the two data frames. Finally we return SDS positive or negative based on the condition.

coll <- (df1$zscore < 0) + 1L
indx1 <- df1[c("A1","A2")][cbind(1:nrow(df1),coll)]
matches <- max.col((xx=indx1 == df2[c("DA","AA")]))
is.na(matches) <- rowSums(xx) == 0L
df3$SDS <- df2$SDS * ifelse(matches == 1,1,-1)
df3
#   Input_SNP         SDS
# 1 1:3344877  0.80517244
# 2 1:3345145  0.74151400
# 3 1:3345216 -0.13022825
# 4 1:3345705 -0.27571036
# 5 1:3346348 -0.28345212
# 6 1:3346403 -0.02469526
Pierre L
  • 28,203
  • 6
  • 47
  • 69
3

We can merge, then flip sign of SDS:

# merge
res <-  merge(df1[, 1:6], df2[, c(1:5, 7)],
              by.x = c("CHR", "BP", "SNP"),
              by.y = c("CHR", "POS", "ID"))

# make Input_SNP id column:
res$Input_SNP <- paste(res$CHR, res$BP, sep = ":")

# then flip effect based on sign and allele match
res$SDS <- ifelse(res$zscore > 0 & res$A1 == res$DA, res$SDS,
                  ifelse(res$zscore > 0 & res$A1 == res$AA, res$SDS * -1,
                         ifelse(res$zscore < 0 & res$A2 == res$DA, res$SDS, 
                                ifelse(res$zscore < 0 & res$A2 == res$AA, res$SDS * -1, NA))))

# subset required columns
res <- res[, c("Input_SNP", "SDS")]
res
#   Input_SNP         SDS
# 1 1:3344877  0.80517244
# 2 1:3345145  0.74151400
# 3 1:3345216 -0.13022825
# 4 1:3345705 -0.27571036
# 5 1:3346348 -0.28345212
# 6 1:3346403 -0.02469526
zx8754
  • 52,746
  • 12
  • 114
  • 209
1

I'm sure there are faster solutions, but a simple loop has the advantage to be quite speaking. And it can be adjusted easily.

df1 <- data.frame(
  BP = 1:6,
  A1 = c("A", "A", "T", "T", "C", "C"),
  A2= c("G", "T", "C", "C", "G", "T"),
  zscore = runif(6, min = -1, max = 1),
  stringsAsFactors = FALSE
)

df2 <- data.frame(
  CHR = 1:6,
  AA = c("G", "T", "T", "C", "G", "C"),
  DA = c("A", "A", "C", "T", "C", "T"),
  SDS = runif(6),
  stringsAsFactors = FALSE
)

df3 <- data.frame(SDS = rep(NA, nrow(df1)))

for (i in 1:nrow(df1)) {
  if (df1$zscore[i] >= 0) {
    if (df1$A1[i] == df2$DA[i]) {
      df3$SDS[i] <- df2$SDS[i]
    } else if (df1$A1[i] == df2$AA[i]) {
      df3$SDS[i] <- -df2$SDS[i]
    }
  } else if (df1$zscore[i] < 0) {
    if (df1$A2[i] == df2$DA[i]) {
      df3$SDS[i] <- df2$SDS[i]
    } else if (df1$A2[i] == df2$AA[i]) {
      df3$SDS[i] <- -df2$SDS[i]
    }
  }
}

df3
nevrome
  • 1,471
  • 1
  • 13
  • 28
  • 2
    The output is not correct. Recheck the logic and result from your loop – Pierre L Nov 14 '16 at 15:34
  • @PierreLafortune Can't reproduce the problem. It yields the same result as your solution for my example data. Please tell me, what's wrong. – nevrome Nov 14 '16 at 16:36
  • 1
    It is not the same output. You have `NA` values because your 'if' statements skip a condition. Think of a scenario where zscore is negative and A2 does not equal DA, your script will stop and return NA when it should move on to test A2 and AA. Your loop will only test A2 == AA **after** A2 == DA is TRUE which will never happen. – Pierre L Nov 14 '16 at 16:49
  • You're right. There's clearly a difference. I even intended to have the `NA` results for that case. But probably my interpretation of OPs description is wrong. I'll correct my solution. – nevrome Nov 14 '16 at 16:58
  • @PierreLafortune Done - did I still miss something? – nevrome Nov 14 '16 at 17:08