9

So i am trying to fit a random forest classifier for my dataset. I am very new to R and i imagine this is a simple formatting issue.

I read in a text file and transform my dataset so it is of this format: (taking out confidential info)

>head(df.train,2)

   GOLGA8A     ITPR3   GPR174  SNORA63    GIMAP8     LEF1    PDE4B LOC100507043    TGFB1I1    SPINT1
Sample1  3.726046 3.4013711 3.794364 4.265287 -1.514573 7.725775 2.162616    -1.514573 -1.5145732 -1.514573
Sample2 4.262779 0.9261892 4.744096 7.276971 -1.514573 4.694769 4.707387     2.031476 -0.8325444  2.615991
...
...
CD8B     FECH    PYCR1 MGC12916     KCNA3 resp
Sample1  -1.514573 2.099336 3.427928 1.542951 -1.514573    1
Sample2 -1.145806 1.204241 2.846832 1.523808  1.616791    1

In essence the columns are my features and the rows my samples, the last column is my response vector which is a column of factors, resp.

Then i use:

set.seed(1) #Set the seed in order to gain reproducibility

RF1 = randomForest(resp~., data=df.train,ntree=1000,importance=T,mtry=3)

Simply trying to train the RF on my column resp using the other columns as features.

But I obtain the error:

Error in eval(expr, envir, enclos) : object 'PCNA-AS1' not found

However, looking into my training set I can clearly find that column, e.g with:

sort(unique(colnames(df.train))

So I don't really understand the error or where to go from here. My apologies if I haven't posed the question in the correct way, thanks for any and all help!

tumultous_rooster
  • 12,150
  • 32
  • 92
  • 149
AHawks
  • 155
  • 1
  • 2
  • 5
  • Could you make this a reproducible example (aka provide sample data for `df.train` that causes the error)? – josliber Jan 29 '16 at 01:04

2 Answers2

23

I would suspect this comes from having an illegal variable name in your data frame. Let's consider a data frame that just has a response variable resp and a variable (illegally) named PCNA-AS1:

(dat <- structure(list(`PCNA-AS1` = c(1, 2, 3), resp = structure(c(2L, 2L, 1L), .Label = c("0", "1"), class = "factor")), .Names = c("PCNA-AS1", "resp"), row.names = c(NA, -3L), class = "data.frame"))
#   PCNA-AS1 resp
# 1        1    1
# 2        2    1
# 3        3    0

Now when we train a random forest we get the indicated error:

library(randomForest)
mod <- randomForest(resp~., data=dat)
# Error in eval(expr, envir, enclos) : object 'PCNA-AS1' not found

A natural solution to this problem would be converting your variable names to all be legal:

names(dat) <- make.names(names(dat))
dat
#   PCNA.AS1 resp
# 1        1    1
# 2        2    1
# 3        3    0
mod <- randomForest(resp~., data=dat)

Now the model trains with no error.

josliber
  • 43,891
  • 12
  • 98
  • 133
  • 2
    Thanks for your comment Josilber, i tried converting to legal names but that wasn't the problem. The error was actually i gave randomForest a matrix (rather than a data frame) which i assumed didn't matter and that randomForest could easily convert between the two. But i was mistaken, so i solved the issue now. – AHawks Jan 29 '16 at 03:20
  • @AHawks OK, then all the more reason for you to edit your question to make it reproducible! (aka including the code and data needed to replicate the issue). Try cutting down the columns in your data frame to the smallest number where you can reproduce the issue, and then post that dataset (if you haven't already figured out what's going on first). – josliber Jan 29 '16 at 03:21
  • Yes, you are definitely correct, that would have been better and i will do that for future questions, just getting used to presenting problems here on stack overflow so thanks for your advice! – AHawks Jan 31 '16 at 22:02
0

So in short, It was a very rookie mistake, i was inputting a matrix rather than a data.frame which was causing this error. Why it was complaining about that particular column (which was not the first) compared to another i still don't understand. Thanks for all the help. Cheers, Anthony

AHawks
  • 155
  • 1
  • 2
  • 5
  • when creating/casting data.frame, check.names=TRUE. So inputting a data.frame could have fixed the problems as illegal col.names would have been edited. In general randomForest gives much fewer problems with data.frame than matrix – Soren Havelund Welling Jan 29 '16 at 12:29
  • 5
    This is not an answer but rather a comment to the real answer, which should be marked as accepted. – Cath Jul 02 '18 at 07:49