1

Im doing random forest predictions using R. My Aggregate_sample.csv data set.

Company Index,Is Customer,videos,videos
watched,tutorials,workshops,casestudies,productpages,other,totalActivities,youngTitleCount,
oldTitleCount,Median between two Activities,Max between 2 activities,
Time since100thactivity
STPT,0,0,3,0,0,0,0,19,22,0,22,120,64074480,0
STPR,0,0,1,0,1,1,0,61,64,0,64,120,56004420,0
PLNRTKNLJS,0,0,0,0,0,0,0,25,25,25,0,810,4349940,0
ASSSNNSP,0,0,0,0,0,3,0,17,20,0,20,60,2220,0
STPP,1,164,32,25,36,26,0,2525,2808,498,2310,60,2938260,76789992
AJKMPTNKSL,0,0,0,0,0,0,0,1,1,0,1,0,0,0
FNKL,0,0,0,1,0,0,0,21,22,0,22,300,2415900,0
FNKK,0,0,1,0,0,0,0,1,2,2,0,60,60,0
FNKN,1,2,0,1,0,0,0,22,25,0,25,480,150840,0

Following is my R script

 # Install and load required packages for decision trees and forests
library(rpart)
install.packages('randomForest')
library(randomForest)

Aggregate <- read.csv("~/Documents/Machine Lerning/preprocessed data/Aggregate_sample.csv")

# splitdf function
splitdf <- function(dataframe, seed=NULL) {
if (!is.null(seed)) set.seed(seed)
index <- 1:nrow(dataframe)
trainindex <- sample(index, trunc(length(index)*0.7))
trainset <- dataframe[trainindex, ]
testset <- dataframe[-trainindex, ]
list(trainset=trainset,testset=testset)
}

splits <- splitdf(Aggregate, seed=808)

#it returns a list - two data frames called trainset and testset
str(splits)

lapply(splits,nrow)

#view the first few columns in each data frame
lapply(splits,head)

training <- splits$trainset
testing  <- splits$testset

#fit the randomforest model
model <- randomForest(as.factor(Aggregate$Is.Customer) ~ Aggregate$seniorTitleCount +
Aggregate$juniorTitleCount + Aggregate$totalActivities + Aggregate$Max.between.2.activities
+ Aggregate$Time.since.100th.activity + Aggregate$downloads , data=training,
importance=TRUE, ntree=2000)

#print(mode)
# what are the important variables
varImpPlot(model)

But i'm keep getting following error and can not proceed.It seems there's something wrong with my IsCustomer column but it's just a column with "0" and "1"s(I don't have any NAs in my dataset).

Error in [<-.factor("*tmp*", keep, value = c("0", "0", "0", "0", "1", : NAs are not allowed in subscripted assignments In addition: Warning message: 'newdata' had 3 rows but variables found have 9 rows

I read following question which seems to be related to my question but couldn't find an answer from it. Assigning within a lapply "NAs are not allowed in subscripted assignments"

Thanks in advance.

Community
  • 1
  • 1
plr
  • 511
  • 3
  • 5
  • 15
  • That fact that you have `Aggregate$` in the formula and also are specifying a `data=` parameter seems suspicious. But if you really want help, you'll need to provide a minimal [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) which includes sample input data so we tell you exactly why that's happening. – MrFlick Dec 11 '14 at 06:03
  • may be there are NAs in the data. Try `na.roughfix` function in random forest package itself – Koundy Dec 11 '14 at 06:19
  • Thanks for replying. @MrFlick i updated with more information. koundy, There's no NAs in my csv as well. – plr Dec 11 '14 at 07:02

1 Answers1

1

Looks like you want to draw your data only form the training data.frame, so you should not be referencing Aggregate in your formula. Using the variable names actually in your test data, this seems to work just fine.

randomForest(as.factor(Is.Customer) ~ oldTitleCount +
   youngTitleCount + totalActivities + Max.between.2.activities +
    Time.since100thactivity + videos , 
data=training,
importance=TRUE, ntree=2000)

which returns

Call:
 randomForest(formula = as.factor(Is.Customer) ~ oldTitleCount +      youngTitleCount + totalActivities + Max.between.2.activities +      Time.since100thactivity + videos, data = training, importance = TRUE,      ntree = 2000) 
               Type of random forest: classification
                     Number of trees: 2000
No. of variables tried at each split: 2

        OOB estimate of  error rate: 16.67%
Confusion matrix:
  0 1 class.error
0 4 0         0.0
1 1 1         0.5
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Thanks allot. It works..!! I just started R day before yesterday. I feel pretty dumb now :) . Thanks again. – plr Dec 11 '14 at 08:09