1

This has been driving me crazy and I've been looking through similar posts all day but can't seem to solve my problem. I have a naive bayes model trained and stored as model. I'm attempting to predict with a newdata data frame but I keep getting the error Error: $ operator is invalid for atomic vectors. Here is what I am running: stats::predict(model, newdata = newdata) where newdata is the first row of another data frame: new data <- pbp[1, c("balls", "strikes", "outs_when_up", "stand", "pitcher", "p_throws", "inning")]

class(newdata) gives [1] "tbl_df" "tbl" "data.frame".

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
Kyle Dixon
  • 285
  • 4
  • 13
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Apr 16 '21 at 19:40
  • I know sorry, I was trying to think how I could give you the model but I'm not sure how. I could give you the training code but it takes like 6 hours to train. Would giving you an actual row of values for `newdata` help? – Kyle Dixon Apr 16 '21 at 19:46
  • Maybe I can try giving you a link to the .Rdata files? Try this Google link, the model is too big for Github. https://drive.google.com/drive/folders/1HppF0msWias4sqxYmlTXO6EAz1bJKCZy?usp=sharing – Kyle Dixon Apr 16 '21 at 19:49
  • That's not really that helpful. Maybe create a simple example using a build in data set that shows the code you used to fit the model and make the prediction. See if you can get the same error message. It's not even clear what type of object `model` is at the moment. – MrFlick Apr 16 '21 at 19:50
  • This is essentially how I fit the data but I'm struggling to reproduce the same error. `model <- caret::train(iris[, 1:4], iris$Species, method = "nb", preProc = c("center", "scale"))` `newdata <- as_tibble(newdata)` `stats::predict(model, newdata = newdata[1, c("Sepal.Width", "Sepal.Length", "Petal.Length", "Petal.Width")])` I added the line to force the tibble because when I query the newdata that I use, it comes back as a tibble already. – Kyle Dixon Apr 16 '21 at 20:13

1 Answers1

1

The issue is with the data used. it should match the levels used in the training. E.g. if we use one of the rows from trainingData to predict, it does work

predict(model, head(model$trainingData, 1))
#[1] Curveball
#Levels: Changeup Curveball Fastball Sinker Slider

By checking the str of both datasets, some of the factor columns in the training is character class

str(model$trainingData)
'data.frame':   1277525 obs. of  7 variables:
 $ pitcher     : Factor w/ 1390 levels "112526","115629",..: 277 277 277 277 277 277 277 277 277 277 ...
 $ stand       : Factor w/ 2 levels "L","R": 1 1 2 2 2 2 2 1 1 1 ...
 $ p_throws    : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
 $ balls       : num  0 1 0 1 2 2 2 0 0 0 ...
 $ strikes     : num  0 0 0 0 0 1 2 0 1 2 ...
 $ outs_when_up: num  1 1 1 1 1 1 1 2 2 2 ...
 $ .outcome    : Factor w/ 5 levels "Changeup","Curveball",..: 3 4 1 4 1 5 5 1 1 5 ...

str(newdata)
tibble [1 × 6] (S3: tbl_df/tbl/data.frame)
 $ balls       : int 3
 $ strikes     : int 2
 $ outs_when_up: int 1
 $ stand       : chr "R"
 $ pitcher     : int 605200
 $ p_throws    : chr "R"

An option is to make levels same for factor class

nm1 <- intersect(names(model$trainingData), names(newdata))
nm2 <- names(which(sapply(model$trainingData[nm1], is.factor)))
newdata[nm2] <- Map(function(x, y) factor(x, levels = levels(y)), newdata[nm2], model$trainingData[nm2])

Now do the prediction

predict(model, newdata)
#[1] Sinker
#Levels: Changeup Curveball Fastball Sinker Slider
akrun
  • 874,273
  • 37
  • 540
  • 662