1

I have a rather basic model that will try to predict the volume of one stock the next day. However, I'd like to predict all three stocks. So instead of one outcome, there's three.

outcomeSymbol <- cbind('AAPL.Volume','ADBE.Volume','ADI.Volume')

Here's what the head of the outcomes looks like (dates in random order):

enter image description here

Here is the training that works fine with one outcome variable ( outcomeSymbol <- 'AAPL.Volume'):

bst <- train(train[,predictorNames],  as.factor(train$outcome),
             method='gbm'
)

But when run this with the 3 outcome variables, I get:
Error: nrow(x) == n is not TRUE

Do I have to use different parameters or a different model if there is more than one outcome?

The entire code, so you can see everything and run it yourself: https://gist.github.com/alteredorange/b97481ed7e00b33bab0d28dcdd7d0e4a

Alteredorange
  • 556
  • 1
  • 6
  • 23
  • This sounds like a question about data modeling which is off-topic for this site. Such questions belong on [stats.se]. If it really is about programming, include a minimal [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) in the question itself. – MrFlick Mar 13 '17 at 15:57
  • @MrFlick The code is kind of long, which is why I included the gist link. Should I include the whole code in the question? – Alteredorange Mar 13 '17 at 17:15
  • 1
    No. You should recreate only what's necessary to make your problem clear. We are here to answer a specific question, not to go through your entire analysis. But that's only if this is actually a programming question which it doesn't seem like to me. – MrFlick Mar 13 '17 at 17:20
  • @MrFlick the specific question is how to model three outcomes in r. The error I get is provided (Error: nrow(x) == n is not TRUE). I'll add the train portion to the question as well. – Alteredorange Mar 13 '17 at 17:52
  • 1
    It seems like your question is more along the lines of *"what statistical method (with an R implementation) can I use to simultaneously model 3 dependent variables"* - which is not a specific programming problem. You seem to assume that `gbm` works with multiple dependent variables, but I see nothing in the documentation to suggest that is true. I think you should try either the Data Science or Statistics stack exchange sites. – Gregor Thomas Mar 13 '17 at 18:20
  • @Gregor True! I'll give those two places a shot. I've asked in other modeling forums more abstractly and they say it's a programming problem. So I try to make it more specific and ask here and it's called a modelling problem :) I'll keep banging away at it, thanks! – Alteredorange Mar 13 '17 at 20:38

1 Answers1

1

You need to change the code in the following way (from line #63 to line #78):

set.seed(1234)
split <- sample(nrow(nasdaq100), floor(0.7*nrow(nasdaq100)))

# process the outcome variables for the entire data 
nasdaq100$outcome <- ifelse(nasdaq100$outcome==1,'yes','nope')
nasdaq100$outcome <- sapply(as.data.frame(nasdaq100$outcome), function(x) as.factor(x)) 

train <-nasdaq100[split,]
test <- nasdaq100[-split,]

# learn 3 different models, one for each outcome variable
bst <- lapply(1:3, function(i) train(train[,predictorNames],train$outcome[,i],method='gbm'))

# compute ROC separately for 3 of the models
library(pROC)
auc <- lapply(1:3, function(i) {
  predictions <- predict(object=bst[[i]], test[,predictorNames], type='prob')
  auc(test$outcome[,i],predictions[,2])
})

# auc scores for 3 models
print(paste('AUC score:', auc)) 
# [1] "AUC score: 0.662664263875109" "AUC score: 0.698058147615867" "AUC score: 0.719709083058406"
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63