Making predictions when model.matrix row count differs from the test data frame in neuralnet

Question

I recently asked the following question about the error requires numeric/complex matrix/vector arguments when working with the neuralnet library. Here is my original question: "Working with neuralnet in R for the first time: get "requires numeric/complex matrix/vector arguments" but don't know how to correct".

The solution was to convert the factors in my data frame to "dummy" variables using the model.matrix function. The resulting code was the following:

matrix.train <- model.matrix( 
  ~ survived + pclass + sex + age + sibsp + parch + fare + embarked, 
  data = train
)

Because my source data frame is peppered throughout with a number of individual NA values, the resulting matrix ends up with 714 rows rather than the 891 rows of the original data frame.

This is OK for my training data. However, when I load my test data frame and convert it to a matrix, I run into the same issue. This time I get 331 matrix rows vs the 418 rows in my source data frame.

After I compute, applying the model to my test data, I'm unable to cbind my predictions back to my test data because the row counts are different. So, my question is:

Is there a way to force model.matrix to output the same number of rows as the source data frame, ignoring NA cases? My model will need to be able to handle NA and still output a prediction because encountering a row with at least one NA is common. Alternately, would it be better to tell the neuralnet to treat NA values as valid factors?

Here is the code I've been attempting to use so far:

#Build a matrix from training data (714 rows vs 891 rows due to NAs in data) 
matrix.train <- model.matrix(
  ~ survived + pclass + sex + age + sibsp + parch + fare + embarked, 
  data=train
)

library(neuralnet)

#Train the neural net
net <- neuralnet(
  survived ~ pclass + sexmale + age + sibsp + parch + fare + embarkedC + 
  embarkedQ + embarkedS, data=matrix.train, hidden=10, threshold=0.01
)

#Build a matrix from test data (331 rows vs 418 rows due to NAs in data)
matrix.test <- model.matrix(~ pclass + sex + age + sibsp + parch + fare + embarked, 
  data=test
)

#Apply neural net to test matrix 
net.results <- compute(
  net, matrix.test
)

#Attempt to map results back to original test data
cleanoutput <- cbind(
  net.results$net.result,test
)

Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 331, 418

When I try to use the rownames from the train data frame to force the matrix.model matrix into the same row count I get the following:

matrix.train <- matrix.train[match(rownames(train),rownames(matrix.train)),]

> matrix.train
    (Intercept) survived pclass sexmale   age sibsp parch     fare embarkedC embarkedQ embarkedS
1             1        0      3       1 22.00     1     0   7.2500         0         0         1
2             1        1      1       0 38.00     1     0  71.2833         1         0         0
3             1        1      3       0 26.00     0     0   7.9250         0         0         1
4             1        1      1       0 35.00     1     0  53.1000         0         0         1
5             1        0      3       1 35.00     0     0   8.0500         0         0         1
6            NA       NA     NA      NA    NA    NA    NA       NA        NA        NA        NA
7             1        0      1       1 54.00     0     0  51.8625         0         0         1

However, that row of NAs is inaccurate. In fact, there may only be one NA value in that row but for some reason whenever one NA value is listed in the row the matrix turns the whole row into NAs. Instead of the above, this is what I would like to see:

> matrix.train
    (Intercept) survived pclass sexmale   age sibsp parch     fare embarkedC embarkedQ embarkedS
1             1        0      3       1 22.00     1     0   7.2500         0         0           1
2             1        1      1       0 38.00     1     0  71.2833         1         0         0
3             1        1      3       0 26.00     0     0   7.9250         0         0         1
4             1        1      1       0 35.00     1     0  53.1000         0         0         1
5             1        0      3       1 35.00     0     0   8.0500         0         0         1
6             1        0      3       1 NA        0     0   6.25           1         0        NA
7             1        0      1       1 54.00     0     0  51.8625         0         0         1

Wouldn't it be simpler to just drop the NA rows in the original data frame to begin with? Or maybe you can match them back up using the row names? — joran, Jul 05 '13 at 15:55
It would definitely be easier to drop the NA rows but, unfortunately, that wouldn't meet my needs. Eventually I need to offer predictions even for rows that have NA values. To clarify, NA row means that one or two columns within the row contain an NA value NOT that the whole row is blank. — user2548029, Jul 05 '13 at 16:07
Well, then, simply use the row names as an index, or create your own dummy column to use as an index. — joran, Jul 05 '13 at 16:10
Could you show me how to do that? I'm still really new to R and haven't ever done that before. (P.S., love your website!) — user2548029, Jul 05 '13 at 16:16
Updated my question with results of using `matrix.train <- matrix.train[match(rownames(train),rownames(matrix.train)),]` — user2548029, Jul 05 '13 at 17:24
Why dont you change your NA values to an implausible number, like for age 999 or 0 so when you get your analisis done you can exclude the age data without loosing other info. — Abdocia, Jul 05 '13 at 18:05
That's definitely a possibility. Alternately, is there a way to tell the model to treat blanks or NAs as valid data? That way I could keep the underlying data untouched. The risk, I guess, would be that the neural net might learn to treat '999' or 'NA' entries as if they had significance rather than just ignoring them. I'd prefer that the neural net ignore individual NA values without ignoring the entire row altogether. — user2548029, Jul 05 '13 at 18:15
I think this link is really helpful for what you ar looking for http://www.ats.ucla.edu/stat/r/faq/missing.htm — Abdocia, Jul 05 '13 at 19:54
you can try `na.pass(matrix.train)` and if that doesn't work you probably need to use it since the previous command. `matrix.train<-matrix.train[match(rownames(na.pass(train)),rownames(na.pass(matrix.train))),]` — Abdocia, Jul 05 '13 at 20:10

Making predictions when model.matrix row count differs from the test data frame in neuralnet

0 Answers0