0

I have a dataset of 17 columns and 500000 rows. I want to predict 250000 of one of these columns. so my training dataset has 250000 rows. after dividing to testing and training set, I ran "gbm" and "lm" model on the set. (

modellm <- train(DARAMAD ~ ., data = trainig, method = "lm", na.action = na.pass)
modelgbm <- train(DARAMAD ~., data = trainig, method = "gbm", na.action = na.omit)

the problem is that when I want to predict, I only receive a vector of 9976 elements while, I try to predict 250000 elements.

z <- predict(modelgbm, newdata = forPredict)
z <- predict(modellm, newdata = forPredict)

forPredict and training datasets both have dimensions of 250000.

mjoudy
  • 149
  • 1
  • 1
  • 10
  • To me it *looks* fine just as it *looks* fine to you.. To find the problem we need more than that. See https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Julius Vainora Mar 17 '18 at 15:54
  • 2
    How many rows have missing data in at least one column? Check `sum(complete.cases(forPredict[ , -grep("^DARAMAD$", names(forPredict)]))`. – eipi10 Mar 17 '18 at 16:40

1 Answers1

0

your code didn't work for me, but I counted NAs as follows:

naCountFunc <- function(x) sum(is.na(x)) naCount <- sapply(trainData, naCountFunc) as.data.frame(table(naCount))

naCount Freq 1 0 12 2 1 1 3 100 2 4 187722 1 5 188664 1

these two columns with high NAs are not the one I want to predict. the "daramad" column hasn't any NA.

mjoudy
  • 149
  • 1
  • 1
  • 10
  • You want to predict `DARAMAD`, but you're including *all* the other columns in your data frame as predictor variables. When fitting the model, only rows for which *all* predictor columns are non-missing will be used for fitting the model. When running `predict` on `forPredict`, only the rows for which *all* predictor columns are non-missing will return predictions. – eipi10 Mar 18 '18 at 17:30
  • Your table shows that, at a minimum, train data has 188,664 rows where at least one column has a missing value. If the column with 187,722 missing values has some missing values in different rows, then many more than 188,664 rows could have at least one missing value. Note that these are the stats for your training data, which affects how many rows of data were used to fit the model. – eipi10 Mar 18 '18 at 17:35
  • 1
    Do the same check on `forPredict` to see the minimum, the number of rows in the test data set that have missing values. (But you should use `complete.cases()` to check how many rows have *at least* one missing value.) – eipi10 Mar 18 '18 at 17:35
  • To check for rows with missing values, here's some reproducible code using the built-in `mtcars` data set. It has 32 rows. Let's assume we want to predict the `mpg` column, so that's the one we'll exclude from the missing predictors check. First, we'll add a missing value: `mtcars[5, 4] = NA`. Then: `sum(complete.cases(mtcars[ , -grep("mpg", names(mtcars))]))`. Or, slightly longer code using `apply`: `sum(apply(mtcars[, -grep("mpg", names(mtcars))], 1, function(x) !any(is.na(x))))`. – eipi10 Mar 18 '18 at 17:40
  • I used `sum(complete.cases(forTrain))` and the answer was 9976. so I think I find a solution for NAs. thank you for your answers. – mjoudy Mar 21 '18 at 10:23