I have a dataset which contains all the quotes made by a company over the past 3 years. I want to create a predictive model using the library caret in R to predict whether a quote will be accepted or rejected.
The structure of the dataset is causing me some problems. It contains 45 variables, however, I have only included two bellow as they are the only variables that are important to this problem. An extract of the dataset is shown below.
contract.number item.id
0030586792 32X10AVC
0030586792 ZFBBDINING
0030587065 ZSTAIRCL
0030587065 EMS164
0030591125 YCLEANOFF
0030591125 ZSTEPSWC
contract.number <- c("0030586792","0030586792","0030587065","0030587065","0030591125","0030591125")
item.id <- c("32X10AVC","ZFBBDINING","ZSTAIRCL","EMS164","YCLEANOFF","ZSTEPSWC")
dataframe <- data.frame(contract.number,item.id)
Each unique contract.number corresponds to a single quote made. The item.id corresponds to the item that is being quoted for. Therefore, quote 0030586792 includes both items 32X10AVC and ZFBBDINING.
If I randomise the order of the dataset and model it in its current form I am worried that a model would just learn which contract.numbers won and lost during training and this would invalidate my testing as in the real world this is not known prior to the prediction being made. I also have the additional issue of what to do if the model predicts that the same contract.number will win with some item.id's and loose with others.
My ideal solution would be to condense each contract.number into a single line with multiple item.ids per line to form a 3 dimensional dataframe. But i am not aware if caret would then be able to model this? It is not realistic to split the item.ids into multiple columns as some quotes have 100s of item.id's. Any help would be much appreciated! (Sorry if I haven't explained well!)