I'm having a problem where I want to create a column of 4 labels. However, when I try to create these, the labels I make eat into re-labeling the first label I have assigned. For example I am looking to create a label
column like this:
Gene Feature1 Feature2 Feature3 ... label
Gene1 1 3 1 most likely
Gene2 0 0 1 probable
Gene3 NA NA NA unknown
Gene4 0 0 0 unlikely
However, my data is very big so my features are not representative here, but the 4 labels are what I'm trying to get. I try to code this with:
df$label[(df$Mechanism == 1)|(df$med >= 3) |(df$OMIM == 1)] <- "most likely"
df$label[is.na(df$label) & (df$med <= 2 )|(df$SideeffectFreq>=1) |(df$MGI_Gene==1) |(df$model_Gene==1) |(df$Rank>=1) ] <- "probable"
df$label[(df$Causality == 'least likely')] <- "least likely"
df$label[is.na(df$label)] <- "unknown"
When I run the first line to create the "most likely" label, this labels 50 genes (which is what I expected), but running the second line for "probable" re-labels some of the "most likely" genes to only give 34 of them left. I thought using is.na(df$label)
or (df$label != 'most likely')
would resolve this, but neither do.
Is there a better way to go about creating a labels column like this? I am new to coding so also if anyone can explain why the is.na(df$label)
or (df$label != 'most likely')
do not work as I expected that would also be really helpful.
Edit: Example where 'most likely' label is taken up:
#Input data:
dput(dt)
structure(list(Gene = c("gene1", "gene2", "gene3", "gene4"),
F1 = c(1L, 0L, 0L, 1L), F2 = c(3L, 0L, 0L, 1L), F3 = c("1",
"1", "1", "least likely"), label = c(NA, NA, NA,
NA)), row.names = c(NA, -4L), class = c("data.table",
"data.frame"))
dt$label[(dt$F1 == 1)|(dt$F2 >= 3) |(dt$F1 == 1)] <- "most likely"
dt$label[(dt$label != 'most likely') & (dt$F1 == 2)|(dt$F2 == 0) |(dt$F1 == 1)] <- "probable"
dt$label[(dt$F1 == 0)|(dt$F2 == 0)] <- "unlikely"
dt$label[(dt$F3 == 'least likely')] <- "unknown"