I'm trying to run naive bayes on my data, a large dataframe of 35 variables, some of which are factors:
nb1927<-naiveBayes(ostpayer ~ ., data=trainoversample)
nb199pred<-predict(nb1927, testoversample, type = "class")
I keep getting the error:
Error in `[.default`(object$tables[[v]], , nd + islogical[attribs[v]]) :
subscript out of bounds
Now, I know from searching that factor levels can be a problem. HOWEVER, this same test set already got passed through logistic regression prediction with no issues after I dropped some levels. So it stands to reason the same exact test set would work for bayes, yes?
I even ran:
sapply(trainoversample, levels)
sapply(testoversample, levels)
On it and then put those results through diffchecker.com (great website btw) and it showed that my test set had FEWER levels than the train set did (because I'd dropped some for the logistic regression by coercing them into the "UNK" factors for those variables).
So it's not possibly the levels. I even did the sapply command for the train set with droplevels()
and put it through diffchecker, still nothing. So it's not that the internal dropping in bayes is doing it either.
Any ideas?
I cannot post data or variable names, but here is an str for one of them in case it helps:
str(testoversample)
'data.frame': 405661 obs. of 35 variables:
$ 1 : int 1207532 1208246 1187313 1259718 1206948 1207319 1206577 1206725 1262913 1209568 ...
$ 2 : num 1668 1208 854 5225 347 ...
$ 3 : Date, format: "2017-04-13" "2017-04-19" "2017-02-13" "2017-11-14" ...
$ 4 : num 50 100 115 1204 30 ...
$ 5 : int 1 1 1 1 1 1 1 1 1 1 ...
$ 6 : Factor w/ 13 levels "1","2","3","4",..: 1 1 1 5 1 1 1 1 5 1 ...
$ 7 : int 0 0 0 0 0 0 0 0 0 0 ...
$ 8 : int 0 0 0 0 0 0 0 0 0 0 ...
$ 9 : Date, format: "2016-02-25" "2016-11-03" "2015-12-29" "2016-11-14" ...
$ 10 : int 0 0 0 0 0 0 0 0 0 0 ...
$ 11 : int 1 1 1 1 1 1 1 1 1 1 ...
$ 12 : num 50 100 115 1204 30 ...
$ 13 : int 284 242 224 313 225 176 318 221 108 244 ...
$ 35 : int 2773 3452 6042 3231 6104 2395 2575 6336 6392 2534 ...
$ 14 : int 1 1 1 1 1 1 1 1 1 1 ...
$ 15 : int 1 6 1 6 3 5 0 13 2 2 ...
$ 16 : int 0 0 0 0 0 0 0 0 0 0 ...
$ 17 : int 0 0 0 0 0 0 0 1 0 0 ...
$ 18 : int 15300 11140 0 9500 8300 1100 16600 500 0 2500 ...
$ 19 : int 13692 1474 0 6916 8981 1543 9687 3 0 1820 ...
$ 20 : int 0 0 0 0 0 0 0 1 0 1 ...
$ 21 : int 0 1 0 0 0 2 0 0 0 1 ...
$ 22: int 3 1 0 1 3 2 2 0 2 0 ...
$ 23 : int 0 3 0 4 1 0 0 5 1 0 ...
$ 24 : Factor w/ 3 levels "BAD","GOOD","UNK": 2 2 2 2 2 2 2 2 2 2 ...
$ 25 : int 1 1 0 1 1 1 0 1 1 0 ...
$ 26 : Factor w/ 6 levels "CUZ","DFA","DNF",..: 4 4 4 4 4 4 4 4 4 4 ...
$ 27 : Factor w/ 50 levels "AK","AL","AR",..: 18 42 17 48 20 32 5 4 27 5 ...
$ 28 : Factor w/ 6 levels "Discharged","Dismissed",..: 3 3 3 3 3 3 3 1 3 3 ...
$ 29 : Factor w/ 3 levels "Dismissed","Other",..: 2 2 2 2 2 2 2 2 2 2 ...
$ 30 : Factor w/ 6 levels "Discharged","Dismissed",..: 3 3 3 3 3 3 3 3 3 3 ...
$ 31 : int 0 0 0 0 0 0 0 0 0 0 ...
$ 32 : Factor w/ 13 levels "Alternate","AlternateCell",..: 6 6 2 5 5 7 6 6 6 5 ...
$ 33 : int 0 0 0 0 0 0 0 0 0 0 ...
$ 34 : num 0 0 0 0 0 0 0 0 0 0 ...