-1

In a fairly balanced binomial classification response problem, I am observing unusual level of error in h2o.gbm classification for determining class 0, on train set itself. It is from a competition which is over, so interest is only towards understanding what is going wrong.

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
            0      1    Error            Rate
0      147857 234035 0.612830  =234035/381892
1       44782 271661 0.141517   =44782/316443
Totals 192639 505696 0.399260  =278817/698335

Any expert suggestions to treat the data and reduce the error is welcome. Following approaches are tried and error is not found decreasing. Approach 1: Selecting top 5 important variables via h2o.varimp(gbm) Approach 2: Converting the negative normalized variable as zero and possitive as 1.

    #Data Definition

# Variable                        Definition

#Independent Variables

# ID                                Unique ID for each observation
# Timestamp                       Unique value representing one day
# Stock_ID                        Unique ID representing one stock
# Volume                            Normalized values of volume traded of                  given stock ID on that timestamp
# Three_Day_Moving_Average        Normalized values of three days moving average of Closing price for given stock ID (Including Current day)
# Five_Day_Moving_Average           Normalized values of five days moving average of Closing price for given stock ID (Including Current day)
# Ten_Day_Moving_Average            Normalized values of ten days moving average of Closing price for given stock ID (Including Current day)
# Twenty_Day_Moving_Average       Normalized values of twenty days moving average of Closing price for given stock ID (Including Current day)
# True_Range                        Normalized values of true range for given stock ID
# Average_True_Range                Normalized values of average true range for given stock ID
# Positive_Directional_Movement   Normalized values of positive directional movement for given stock ID
# Negative_Directional_Movement   Normalized values of negative directional movement for given stock ID

#Dependent Response Variable
# Outcome                           Binary outcome variable representing whether price for one particular stock at the tomorrow’s market close is higher(1) or lower(0) compared to the price at today’s market close


temp <- tempfile()
download.file('https://github.com/meethariprasad/trikaal/raw/master/Competetions/AnalyticsVidhya/Stock_Closure/test_6lvBXoI.zip',temp)
test <- read.csv(unz(temp, "test.csv"))
unlink(temp)


temp <- tempfile()
download.file('https://github.com/meethariprasad/trikaal/raw/master/Competetions/AnalyticsVidhya/Stock_Closure/train_xup5Mf8.zip',temp)
#Please wait for 60 Mb file to load.
train <- read.csv(unz(temp, "train.csv"))
unlink(temp)

summary(train)

#We don't want the ID
train<-train[,2:ncol(train)]
# Preserving Test ID if needed
ID<-test$ID
#Remove ID from test
test<-test[,2:ncol(test)]
#Create Empty Response SalePrice
test$Outcome<-NA
#Original
combi.imp<-rbind(train,test)

rm(train,test)
summary(combi.imp)

#Creating Factor Variable
combi.imp$Outcome<-as.factor(combi.imp$Outcome)
combi.imp$Stock_ID<-as.factor(combi.imp$Stock_ID)
combi.imp$timestamp<-as.factor(combi.imp$timestamp)

summary(combi.imp)


#Brute Force NA treatment by taking only complete cases without NA.
train.complete<-combi.imp[1:702739,]
train.complete<-train.complete[complete.cases(train.complete),]
test.complete<-combi.imp[702740:804685,]

library(h2o)
y<-c("Outcome")
features=names(train.complete)[!names(train.complete) %in% c("Outcome")]
h2o.shutdown(prompt=F)
#Adjust memory size based on your system.
h2o.init(nthreads = -1,max_mem_size = "5g")

train.hex<-as.h2o(train.complete)
test.hex<-as.h2o(test.complete[,features])

#Models
gbmF_model_1 = h2o.gbm( x=features,
                        y = y,
                        training_frame =train.hex,
                        seed=1234
)
h2o.performance(gbmF_model_1)
Hari Prasad
  • 1,751
  • 2
  • 15
  • 20
  • There is not enough information here for me to respond with anything useful, since you are asking for general data science advice (without providing information about the dataset) and not for help with coding or software. You need a reproducible example, you need to explain why you think the GBM is not performing well. What do you expect the performance to be and why? – Erin LeDell Apr 03 '17 at 23:29
  • Thanks Erin. 1. Reproducible Example: Code I have placed is reproducible from any R studio with h2o package as data is being read through URL. We can run this code as is and get the result. 2. What we are seeing with the training data huge miss-classification in classifying the binary classification 0, almost 60%+. I assume that this happens usually in a imbalanced response data where there are few response are of class 0 and rest as class 1. But here the responses are almost 50% balanced. Question is how to reduce missclassification of 0? – Hari Prasad Apr 04 '17 at 12:21
  • Erin, in Code in the begining itself, I have explained each and every column of data. That is the only information which I have on data set. – Hari Prasad Apr 04 '17 at 12:25
  • Hari, when I first commented, I only saw your data definitions and missed the fact that you were actually importing the data. I still think this is more of a general data science/modeling question rather than a software question (there are no bugs or errors in the code itself), so I can't be of much help, sorry. – Erin LeDell Apr 04 '17 at 18:19

1 Answers1

1

You've only trained a single GBM with the default parameters, so it doesn't look like you've put enough effort into tuning your model. I'd recommend a random grid search on GBM using the h2o.grid() function. Here is an H2O R code example you can follow.

Erin LeDell
  • 8,704
  • 1
  • 19
  • 35
  • Thanks Erin. I will surely do that and I agree that my question sounds generic. Moreover I am strongly believing for stock exchange problems, invariably we need to do feature engineering by deriving the typical traders indicators like average training index and others. Thanks for spending some time on it. I am keeping this question open to see if we can get any other insights for few days, then will close. Hope it is in the spirit of stackoverflow. Thanks again. – Hari Prasad Apr 04 '17 at 19:51
  • 2
    It turned out to be a great decision to follow your comment. Did Cartesian search rather than Random search and got good results. Thanks Erin! – Hari Prasad Apr 05 '17 at 14:41