Note: After lots of experimenting with the code, I have completely re-written this question
I'm trying to use user-input values in a 1-row data object to predict the user's category with randomForest
, however I get an error indicating NA/Inf values of my data object.
I have a randomForest
classifier, which I've trained on a taining dataset and validated on a validation dataset. This was done in my file analysis.R
on github and the object is saved as rf.rds
, which is read in by server.R
).
In server.R
I read in the training data which is called x
(i.e. x.rds
) and then extract only the first row into userdf
.
In ui.R
I let users enter values which reactively update this object:
values <- reactiveValues()
values$df <- userdf
newEntry <- observe({
values$df$bron_badges <- input$bron_badges
values$df$silv_badges <- input$silv_badges
values$df$gold_badges <- input$gold_badges
values$df$reputation <- input$reputation
values$df$views <- input$views
values$df$votes <- input$votes
})
This appears to work. I say so because I can run:
output$table <- renderTable({data.frame(values$df)})
and watch the values update beautifully in my UI.
However, when I try to run the following code to run a prediction for the user I get an error message saying that there are NA's:
output$results <- renderText({
{ ds1 <- values$df
x <- x[,sort(names(x))]
ds1 <- ds1[,sort(names(ds1))]
names(ds1) <- colnames(x)
predict(rf, newdata = data.frame(ds1))
}
})
Even though I "know" the data is not NA
from having watched values$df
update via ui.R
in the line mentioned above and by virtue of the fact that all of the initial values which come from x
are not NA
. I've also tried it without the data.frame
part of the predict
statement.
Interestingly, if I replace the predict
statement above with table(is.na(ds1))
it tells me that all 1,033 values are NA
.
Also interesting, if I replace ds1
with userdf
in the predict
statement, then everything runs fine (userdf
is the non-reactive object).
If I replace the predict
statement with setdiff(colnames(x), colnames(ds1))
it does not show any mis-matching column names (it did until the addition of the colnames
statements above, due to some weird conversion of _
to .
in the reactive dataframe's colnames).
Finally, I find that if I access the names from rf
via rf$forest$ncat
I get "incorrect number of dimensions
" as my error:
output$results <- renderTable({
{ ds1 <- values$df
cn <- rf$forest$ncat
cn <- cn[,sort(names(cn))]
ds1 <- ds1[,sort(names(ds1))]
names(ds1) <- names(cn)#x #rf$forest$xlevels
predict(rf, newdata = data.frame(ds1))
}
})
However, with the following modification:
output$results <- renderTable({
{ ds1 <- values$df
cn <- as.data.frame(t(rf$forest$ncat))
cn <- cn[,sort(names(cn))]
ds1 <- ds1[,sort(names(ds1))]
names(ds1) <- names(cn)#x #rf$forest$xlevels
predict(rf, newdata = data.frame(ds1))
}
})
My error goes back to "variables in the training data missing in newdata
".
Minimal, reproducible example: https://github.com/hack-r/troubleshooting_predictor_minimal
Here's the full reproducible code and data: https://github.com/hack-r/coursera_shiny