In order to use random forests etc, that have a cap on how many Categorical values they can ingest for a variable, the solution is normally to re-code to numeric. For example, I have a product list that has 100's of products. So I recode using as.numeric. This transforms apple to 1, banana to 2, etc.. As I do this during the build stage, its done across my training, testing and validation data before I split them into those 3 buckets, and therefore the model runs perfectly, and I can score the test and validation data.
My issue is, when I want to run the model against brand new data, e.g. last weeks new customers. They dont have all the same categorical variables, so when I run the as.numeric, it gives each categorical value and completely new numeric value, not lining up with the original numeric values in the model. And also, they may introduce new products, previously unseen in the model. For example, we have never sold and orange before, and now it exists, but in the training data, there are no oranges.
So i run into 2 problems:
1) "Type of predictors in new data do not match that of the training data" in the case of "orange" existing.
2) The new numeric values dont represent the same categoric values in the training data.
I haven't been able to find a solution to this using google. I've tried packages like "vtreat" but that has the same issue.
Is there a package/caret function or something that handles this? How is it done?
(My manual option is to create a lookup table for every possible categorical value in every column, which is going to be horrendously laborious to maintain, so looking for other options).
Thanks in advance.
Here is a reproducible exampple:
library(dplyr)
library(caret)
library(randomForest)
Trn_ID <- c(1:10)
Cust_ID <- c('1','2','3','4','5','5','4','2','1','5')
Subrub <- c('Malvern','Bentleigh','CBD','Ivanhoe','Altona','Altona','Ivanhoe','Bentleigh','Malvern','Altona')
Product <- c('Fish','Apple','Fish','Banana','Bread','Fish','Milk','Apple','Banana','Bread')
Online <- c('Y','N','N','N','Y','Y','Y','N','Y','Y')
df <- data.frame(Trn_ID,Cust_ID,Subrub,Product,Online)
## cleanup
df$Subrub <- as.numeric(df$Subrub)
df$Product <- as.numeric(df$Product)
set.seed(150)
inTrain <- createDataPartition(y=df$Online, p=0.75, list=FALSE)
training <- df[inTrain,]
testing <- df[-inTrain,]
training$Online <- factor(training$Online)
testing$Online <- factor(testing$Online)
model <- randomForest(training$Online ~ ., data=training, ntree=50, mtry=3, importance=TRUE, replace=FALSE)
print(model)
## new data
Trn_ID <- c(11:14)
Cust_ID <- c('1','2','6','5')
Subrub <- c('Malvern','Bentleigh','Alphington','Altona')
Product <- c('Apple','Fish','Orange','Banana')
Online <- c(NA,NA,NA,NA)
nd <- data.frame(Trn_ID,Cust_ID,Subrub,Product,Online)
## cleanup
nd$Subrub <- as.numeric(nd$Subrub)
nd$Product <- as.numeric(nd$Product)
## score
sd <- predict(model,newdata=nd, type="prob")