0

One of the variables, 'Cabin', has a hefty amount of NAs. I am trying to use a decision tree (rpart) to predict the Cabin deck of passengers whose Cabin is not available.

Currently, this is the structure of my data table, which is a rbind of the training and test sets.

 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 187 levels "","A10","A14",..: NA 83 NA 57 NA NA 131 NA NA NA ...
 $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
 $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
 $ FamilySize : num  2 2 1 2 1 1 1 5 3 2 ...
 $ FamilyID   : Factor w/ 8 levels "11","3","4","5",..: 8 8 8 8 8 8 8 4 2 8 ...
 $ FamilyID2  : Factor w/ 7 levels "11","4","5","6",..: 7 7 7 7 7 7 7 3 7 7 ...
 $ Title      : Factor w/ 11 levels "Col","Dr","Lady",..: 7 8 5 8 7 7 7 4 8 8 ...
 $ Surname    : chr  "Braund" "Cumings" "Heikkinen" "Futrelle" ...
 $ Cabin2     : Factor w/ 8 levels "A","B","C","D",..: NA 3 NA 3 NA NA 5 NA NA NA ...

Please note that I have used strsplit to create 'Cabin2' which has extracted the letter of the 'Cabin' variable, which corresponds to the deck on the Titanic to my understanding. This significantly reduced the number of levels that I was fighting with from 187 with 'Cabin' to 8 with 'Cabin2.'

I am trying to use the following code to predict the cabin deck:

cabinFit <- rpart(Cabin2 ~ Age + Sex + Fare + Embarked + SibSp + Parch + Title + FamilySize + FamilyID,

combi$Cabin2[is.na(combi$Cabin2)] <- predict(cabinFit,     combi[is.na(combi$Cabin2),])

The output that I am being thrown by R is as follows:

 Warning messages:
 1: In `[<-.factor`(`*tmp*`, is.na(combi$Cabin2), value = c(NA, 3L,   :
  invalid factor level, NA generated
 2: In `[<-.factor`(`*tmp*`, is.na(combi$Cabin2), value = c(NA, 3L,   :
  number of items to replace is not a multiple of replacement length

I am desperately trying to make sense of this as I continue fiddling with these data, however I am coming up short as to why this bit of code doesn't do the trick for me.

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
  • Try `predict(cabinFit, combi[is.na(combi$Cabin2),], type="class")`. Otherwise be sure to include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that include sample data we can copy/paste into R to run your code. It doesn't have to be the actual data you are using, just something similar to reproduce the error. – MrFlick Jun 11 '15 at 03:10
  • Hey Dylan. I also participated in this challenge. I noticed that those who have cabin as NA usually perish (you could check the fraction survived = 0 of all of people without cabin). I assume that cabin = NA means exactly that - passenger had no personal cabin and was located in ship's hold. Try to use NA as a separate type of cabin - it should be fine . – Maksim Khaitovich Jun 11 '15 at 19:03
  • Thanks all! MrFlick's subtle yet powerful recommendation did the trick! Thanks! – Dylan K Jun 11 '15 at 22:34

0 Answers0