0

I have a dataset which contains all the quotes made by a company over the past 3 years. I want to create a predictive model using the library caret in R to predict whether a quote will be accepted or rejected.

The structure of the dataset is causing me some problems. It contains 45 variables, however, I have only included two bellow as they are the only variables that are important to this problem. An extract of the dataset is shown below.

contract.number     item.id  
0030586792          32X10AVC
0030586792          ZFBBDINING
0030587065          ZSTAIRCL
0030587065          EMS164
0030591125          YCLEANOFF
0030591125          ZSTEPSWC



contract.number <- c("0030586792","0030586792","0030587065","0030587065","0030591125","0030591125")
item.id <- c("32X10AVC","ZFBBDINING","ZSTAIRCL","EMS164","YCLEANOFF","ZSTEPSWC")
dataframe <- data.frame(contract.number,item.id)

Each unique contract.number corresponds to a single quote made. The item.id corresponds to the item that is being quoted for. Therefore, quote 0030586792 includes both items 32X10AVC and ZFBBDINING.

If I randomise the order of the dataset and model it in its current form I am worried that a model would just learn which contract.numbers won and lost during training and this would invalidate my testing as in the real world this is not known prior to the prediction being made. I also have the additional issue of what to do if the model predicts that the same contract.number will win with some item.id's and loose with others.

My ideal solution would be to condense each contract.number into a single line with multiple item.ids per line to form a 3 dimensional dataframe. But i am not aware if caret would then be able to model this? It is not realistic to split the item.ids into multiple columns as some quotes have 100s of item.id's. Any help would be much appreciated! (Sorry if I haven't explained well!)

jmuhlenkamp
  • 2,102
  • 1
  • 14
  • 37
  • What type of model are you even running with this data? You seem to basically have two categorical variables here that you can't really train on. There doesn't really seem to be a clear programming question here. If you need help choosing the right statistical model for your data, you would get better help from the statisticians at [stats.se]. First know exactly what type of test/model you want to run (or should run). Then if you don't know how to do that in R, you can ask a more specific question here. – MrFlick Jul 04 '18 at 22:57
  • @MrFlick My plan is to use a range of different models to test which yields the best results. Initially, I will use KNN, Random forests and NBC. Thanks for your help I may seek advice from there. – Ross Headington Jul 04 '18 at 23:01
  • @RossHeadington You will need to provide a a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to help you better. As this is more of a data modelling/ implementation question, it will also be suited for [Data Science Stack Exchange](https://datascience.stackexchange.com). If any statistical help required than post at cross validated. – Mankind_008 Jul 05 '18 at 00:03
  • items can be one-hot encoded, plenty of posts on SO on how to do that. Caret can do this as well. Do not use the contract.number in any of your models. The model will learn on this feature and is completely useless. A contract number has no predictive value whatsoever as this is just a incremental value from the transaction system. You should check if the other variables have some predictive value. – phiver Jul 05 '18 at 06:58

0 Answers0