-1

I'm having a problem where I want to create a column of 4 labels. However, when I try to create these, the labels I make eat into re-labeling the first label I have assigned. For example I am looking to create a label column like this:

Gene   Feature1   Feature2   Feature3 ...  label
Gene1   1            3         1            most likely
Gene2   0            0         1            probable
Gene3   NA           NA        NA           unknown
Gene4   0            0         0            unlikely

However, my data is very big so my features are not representative here, but the 4 labels are what I'm trying to get. I try to code this with:

df$label[(df$Mechanism == 1)|(df$med >= 3) |(df$OMIM == 1)] <- "most likely"

df$label[is.na(df$label) & (df$med <= 2 )|(df$SideeffectFreq>=1) |(df$MGI_Gene==1) |(df$model_Gene==1) |(df$Rank>=1) ] <- "probable"

df$label[(df$Causality == 'least likely')] <- "least likely"

df$label[is.na(df$label)] <- "unknown"

When I run the first line to create the "most likely" label, this labels 50 genes (which is what I expected), but running the second line for "probable" re-labels some of the "most likely" genes to only give 34 of them left. I thought using is.na(df$label) or (df$label != 'most likely') would resolve this, but neither do.

Is there a better way to go about creating a labels column like this? I am new to coding so also if anyone can explain why the is.na(df$label) or (df$label != 'most likely') do not work as I expected that would also be really helpful.

Edit: Example where 'most likely' label is taken up:

#Input data:
dput(dt)
structure(list(Gene = c("gene1", "gene2", "gene3", "gene4"), 
    F1 = c(1L, 0L, 0L, 1L), F2 = c(3L, 0L, 0L, 1L), F3 = c("1", 
    "1", "1", "least likely"), label = c(NA, NA, NA, 
    NA)), row.names = c(NA, -4L), class = c("data.table", 
"data.frame"))

dt$label[(dt$F1 == 1)|(dt$F2 >= 3) |(dt$F1 == 1)] <- "most likely"
dt$label[(dt$label != 'most likely') & (dt$F1 == 2)|(dt$F2 == 0) |(dt$F1 == 1)] <- "probable"
dt$label[(dt$F1 == 0)|(dt$F2 == 0)] <- "unlikely"
dt$label[(dt$F3 == 'least likely')] <- "unknown"
desertnaut
  • 57,590
  • 26
  • 140
  • 166
DN1
  • 234
  • 1
  • 13
  • 38
  • Take a look at the `which()`, `dplyr::case_when()` or `switch()` - they all help to assign values based on condiations. Can I also suggest you include a reproducible example? It's much easier to offer specific advice https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Jay Achar Sep 21 '20 at 10:05
  • Thank you for this, I've had a go at including an example where my problem is. I can't unfortunately use my actual data. – DN1 Sep 21 '20 at 10:14

1 Answers1

1

You can use case_when or nested ifelse statements so that every row will satisfy only one condition based on their occurrence.

library(dplyr)

dt %>%
  mutate(label = case_when(Mechanism == 1 | med >= 3 | OMIM == 1 ~ 'most likely', 
                 med <= 2 | ideeffectFreq >= 1 | MGI_Gene==1 | Rank>=1  ~ 'probable', 
                 #add more conditions
                 #if none of the conditions satisfy from above assign "unknown"
                 TRUE ~ 'unknown'))

If you have data.table, it has fcase which is similar to case_when :

library(data.table)
dt[, label := fcase( Mechanism == 1 | med >= 3 | OMIM == 1 , 'most likely', 
                     #more conditions
                     default = 'unknown')]
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you these look like they will work. I'm just having errors in both cases: for dplyr it's ```Error: Problem with `mutate()` input `label`. x Input `label` must be a vector, not a `formula` object.``` and for data.table it's ```Error in fcase(Mechanism == 1 | BPmed >= 3 | OMIM == 1, "most likely", : could not find function "fcase"``` - is this a case where I need to update my library? – DN1 Sep 21 '20 at 10:26
  • `fcase` is a recent function in `data.table` . My `packageVersion('data.table')` is 1.13.0. Maybe you need to update the library. For `dplyr` `case_when`, I have updated the answer and it should work now. – Ronak Shah Sep 21 '20 at 10:31
  • Thank you, I think we have different R versions for both cases as the dplyr code now gives the error ```Error in .shallow(x, cols = cols, retain.key = TRUE) : can't set ALTREP truelength``` - I'll see if I can update my packages. I do have an older version of data.table and my dplyr version is 1.0.1 – DN1 Sep 21 '20 at 10:52
  • All done now (had to update my Rtools to properly update the packages) and it worked perfectly, for both of them, thanks for your help – DN1 Sep 21 '20 at 15:58