1

I have two dataframes in R, one of which contains model outputs and the other contains model thresholds. That is, the outputs dataframe (call it df1) looks something like this:

model1      model2       model3
0.086       0.2645728    0.0001668753
0.024       0.2109496    0.0001905100
0.052       0.2484194    0.0038053175
0.274       0.3650003    0.0002842775
0.260       0.4055953    0.0280523161

And the threshold dataframe (call it df2) looks something like:

model   threshold
model1  0.5520000
model2  0.7924895
model3  0.7537394

I want to apply the >= operation to each entry in df1 where the column name is equal to the model name in df2, and store these binaries in a new dataframe (call it df3), which would be the same size as df1. That is, df3 is the predicted label for each entry in df1, given the corresponding model-based threshold in df2. It's clear that I could do this in a brute force for-loop fashion like:

df3 = df1
for (mdl in df2$model) {
   df3[, mdl] = df1[, mdl] >= df2$threshold[df2$model==mdl]
}

I don't like this solution, and I'm hoping there is a more R-based way to perform this operation.

Reproducible Sample Data

df1 <- read.table(header = TRUE, text = "
                  model1      model2       model3
0.086       0.2645728    0.0001668753
0.024       0.2109496    0.0001905100
0.052       0.2484194    0.0038053175
0.274       0.3650003    0.0002842775
0.260       0.4055953    0.0280523161") 

df2 <- read.table(header = TRUE, text = "
                  model   threshold
model1  0.5520000
model2  0.7924895
model3  0.7537394")
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
CopyOfA
  • 767
  • 5
  • 19
  • 1
    several comments: 1. please share data in reproducible format. https://www.stackoverflow.com/questions/5963269 2. different models in df3 may have different length, how to deal with it – Peace Wang Apr 29 '21 at 17:04
  • Thanks for pointing this out. I edited to correct for (1), but I'm not sure what you mean by (2). Every model has the same number of outputs, so every `>=` operation should result in a boolean value. – CopyOfA Apr 29 '21 at 17:23
  • No problem now for (2) after your explaination. – Peace Wang Apr 30 '21 at 00:15

2 Answers2

1

You can use map2 function from purrr package and also mapply from base R since we are iterating over two variables at the same time:

library(purrr)

df1 %>%
  map2_dfc(., df2$threshold, ~ .x >= .y)


# A tibble: 5 x 3
  model1 model2 model3
  <lgl>  <lgl>  <lgl> 
1 FALSE  FALSE  FALSE 
2 FALSE  FALSE  FALSE 
3 FALSE  FALSE  FALSE 
4 FALSE  FALSE  FALSE 
5 FALSE  FALSE  FALSE 

Or using mapply of apply family of functions:

mapply(function(x, y) x >= y, df1, df2$threshold)

     model1 model2 model3
[1,]  FALSE  FALSE  FALSE
[2,]  FALSE  FALSE  FALSE
[3,]  FALSE  FALSE  FALSE
[4,]  FALSE  FALSE  FALSE
[5,]  FALSE  FALSE  FALSE
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
1

An alternative stupid method:

transform df2 to a data frame with the same size as df1

df <- data.frame(
    model1 = rep(df2[which(df2$model == "model1"), "threshold"],nrow(df1)),
    model2 = rep(df2[which(df2$model == "model2"), "threshold"],nrow(df1)),
    model3 = rep(df2[which(df2$model == "model3"), "threshold"],nrow(df1))
)

df3 <- df1 >= df
Peace Wang
  • 2,399
  • 1
  • 8
  • 15