1

I have a data frame "comp". Sample for reference:

comp <- data.frame(A=c(1:5), B=c(1,0,1,0,0), C=c(5,2,0,0,NA), D=c(1,3,1,NA,0))

  A B  C D
1 1 1  5 1
2 2 0  2 3
3 3 1  0 1
4 4 0  0 NA
5 5 0 NA 0

I'd like to iterate a for loop over every column (excluding the first two). Basically the loop is supposed to print a particular string or NA depending on both the value in that cell and the value in column 2 of that row. The rules for what to print in C are:

  • If C is positive and B is 1: "Ysnp, Yphen"
  • If C is positive and B is 0: "Ysnp, Nphen"
  • If C is 0 and B is 1: "Nsnp, Yphen"
  • If C is 0 and B is 0: "Nsnp, Nsnp"
  • If C is NA: NA

These same rules would also apply to column D (just replace C with D in the above rules). For my sample data it would look like this:

  A B C              D
1 1 1 "Ysnp, Yphen"  "Ysnp, Yphen"
2 2 0 "Ysnp, Nphen"  "Ysnp, Nphen"
3 3 1 "Nsnp, Yphen"  "Ysnp, Yphen"
4 4 0 "Nsnp, Nphen"  NA
5 5 0 NA             "Nsnp, Nphen"

My real data set has 50+ columns, so applying the for loop to each one is tedious. This is what I tried:

sapply(comp[,-(1:2)], function(snp) {
  for (i in 1:nrow(comp)){
    if (comp$snp[i]!=0 & !is.na(comp$snp[i])){
      if (comp[i, 2]==1) comp$snp[i] <- "Ysnp, Yphen"
      else comp$snp[i] <- "Ysnp, Nphen"
    }
    else if (comp$snp[i]==0 & !is.na(comp$snp[i])){
      if (comp[i, 2]==1) comp$snp[i] <- "Nsnp, Yphen"
      else comp$snp[i] <- "Nsnp, Nphen"
    }
    else comp$snp[i] <- NA
  }
})

However when I run this loop I get the following error:

Error in if (comp$snp[i] != 0 & !is.na(comp$snp[i])) { : 
  argument is of length zero

I've checked that my data frame does not contain any NULL values, so I'm not sure why the loop is generating this error. I also tried replacing comp$snp[i] with comp[i, snp] throughout the loop, or using apply instead of sapply, but that didn't solve the problem.

  • 1
    Please provide some minimal sample data and your expected output. Also, a `for` loop *inside* `sapply` seems very strange. – Maurits Evers Mar 15 '18 at 21:32
  • Added sample data/output. I agree it's strange but I'm not sure how else to generate my desired output. – Maya Gosztyla Mar 15 '18 at 21:46
  • every operation you do in your loop can be vectorized, drop the for loop and manipulate columns directly (using ifelse instead of if ... else) – moodymudskipper Mar 15 '18 at 21:51
  • This looks like a potential merge/match&replace; what are the rules for replacing entries in `C` with the strings? It seems `5 or 2 => "Ysnp, Yphen"`, `0 => "Ysnp, Nphen"`? Can you provide more details as to the logic? – Maurits Evers Mar 15 '18 at 21:51
  • I edited it to include more details on rules. Sorry the post is quite long now! – Maya Gosztyla Mar 15 '18 at 21:55
  • Your expected output is not consistent with your rules. For example, line 4, `C = 0` and `B = 0`. Why does `C` become `"Nsnp, Nphen"`? `C` is not negative! – Maurits Evers Mar 15 '18 at 22:01
  • And line 2, `C = 2` and `B = 0`. According to your rules, `C` should become `"Ysnp, Nphen"`; but you have `"Nsnp, Yphen"`. – Maurits Evers Mar 15 '18 at 22:03
  • My apologies, I've corrected it. – Maya Gosztyla Mar 15 '18 at 22:06
  • Nope, still not correct. Line 4: Both `C` and `B` are `0`, so according to rules should become `"Ysnp, Nphen"`; but output says `"Nsnp, Nphen"`. Actually, I just realised: You have two rules for `C=0` and `B=0`. – Maurits Evers Mar 15 '18 at 22:09
  • The function argument snp refers to the values of the column, not the names of the column. – Hugh Mar 15 '18 at 22:10
  • My thinking was that since sapply would apply the function over each column, the snp argument would be read as the column name. If this is not the case, what argument would I need to use instead? – Maya Gosztyla Mar 15 '18 at 22:14
  • Your revised `comp` sample `data.frame` has a misplaced `)` (right bracket). – Maurits Evers Mar 15 '18 at 22:44

1 Answers1

1

This should be a simple matter for case_when:

comp <- data.frame(A=c(1:5), B=c(1,0,1,0,0), C=c(5,2,0,0,NA))

library(tidyverse);
comp %>%
    mutate(C = case_when(
        C > 0 & B == 1 ~ "Ysnp, Yphen",
        C > 0 & B == 0 ~ "Ysnp, Nphen",
        C == 0 & B == 1 ~ "Nsnp, Yphen",
        C == 0 & B == 0 ~ "Nsnp, Nsnp",
        is.na(C) ~ "NA"));
#  A B           C
#1 1 1 Ysnp, Yphen
#2 2 0 Ysnp, Nphen
#3 3 1 Nsnp, Yphen
#4 4 0  Nsnp, Nsnp
#5 5 0          NA

Rules:

  • If C is positive and B is 1: "Ysnp, Yphen"
  • If C is positive and B is 0: "Ysnp, Nphen"
  • If C is 0 and B is 1: "Nsnp, Yphen"
  • If C is 0 and B is 0: "Nsnp, Nsnp"
  • If C is NA: NA

Update

For an arbitrary number of columns, you could use a for loop. The for loop will be very fast because you're just replacing entries in an existing data.frame, and there is no dynamic memory (re-)allocation.

comp <- data.frame(A=c(1:5), B=c(1,0,1,0,0), C=c(5,2,0,0,NA), D=c(1,3,1,NA,0))


df <- comp;
for (i in 3:ncol(df)) {
    df[, i] <- ifelse(is.na(df[, i]), "NA", paste(
        ifelse(df[, i] > 0, "Ysnp", "Nsnp"),
        ifelse(df$B == 1, "Yphen", "Nphen"), sep = ", "));
}
#  A B           C           D
#1 1 1 Ysnp, Yphen Ysnp, Yphen
#2 2 0 Ysnp, Nphen Ysnp, Nphen
#3 3 1 Nsnp, Yphen Ysnp, Yphen
#4 4 0 Nsnp, Nphen          NA
#5 5 0          NA Nsnp, Nphen

It turns out you don't even need a for loop but can use direct indexing.

df[, 3:ncol(df)] <- ifelse(is.na(df[, 3:ncol(df)]), "NA", paste(
    ifelse(df[, 3:ncol(df)] > 0, "Ysnp", "Nsnp"),
    ifelse(df$B == 1, "Yphen", "Nphen"), sep = ", "));
df;
#  A B           C           D
#1 1 1 Ysnp, Yphen Ysnp, Yphen
#2 2 0 Ysnp, Nphen Ysnp, Nphen
#3 3 1 Nsnp, Yphen Ysnp, Yphen
#4 4 0 Nsnp, Nphen          NA
#5 5 0          NA Nsnp, Nphen
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Sorry about the inconsistency, I have corrected it now. Thank you for the case_when suggestion, this seems very helpful! However I'm not sure how I would get this to iterate over all of the columns in my real data set, which has 50+ columns? Perhaps some combination of mutate and sapply? – Maya Gosztyla Mar 15 '18 at 22:11
  • @MayaGosztyla Your rules are still not consistent with output. See my comment above. You've got two rules for `C = B = 0`. – Maurits Evers Mar 15 '18 at 22:12
  • Jeez, this post is a mess, ha! I've fixed it. The rules are so strange that I'm even confusing myself... – Maya Gosztyla Mar 15 '18 at 22:18
  • @MayaGosztyla OK, my solution now reproduces your output. I'm not sure what you mean by *"iterate over all of the columns"*. So is your example data not representative? In that case, this is a bit of a moot effort, and I suggest providing a [representative & minimal example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Maurits Evers Mar 15 '18 at 22:23
  • The difference between the example data and my real data is that my real data has 50+ columns that all have similar information to column C, and I'd like to apply the same function to each one. I can add a column D to my example data if it would be helpful? Thanks for your patience, I'm new at this. EDIT: Added column D, hope this makes my example better. – Maya Gosztyla Mar 15 '18 at 22:28
  • @MayaGosztyla I'm still not sure on the generalised rules for `>2` columns. Can you elaborate? What are the rules for e.g. 5 columns, 6 columns and so on? Is this always about replacing values in the last column? Or multiple columns? – Maurits Evers Mar 15 '18 at 22:35
  • The rules for all other columns would be the same. Basically I would like to take the same function you described in your answer, and then apply it to columns C, D, E, F, etc. So for every column, it would look at an entry, compare that to the corresponding entry in column 2, and print the appropriate response before going to the next one. – Maya Gosztyla Mar 15 '18 at 22:39
  • @MayaGosztyla Ok, I think I understand. Please take a look at my updated solution. – Maurits Evers Mar 15 '18 at 23:00
  • Ah this is exactly what I needed! Thanks so much for your help/patience! I’ve been puzzling over this function for hours. – Maya Gosztyla Mar 15 '18 at 23:03
  • @MayaGosztyla You're very welcome. Glad it worked out:-) – Maurits Evers Mar 15 '18 at 23:05
  • @MayaGosztyla One last update: It turns out you don't even need a `for` loop, but can use direct indexing. Even more elegant. – Maurits Evers Mar 16 '18 at 00:26