I am creating a tabular classification model using CreateML and Swift. The dataset I am using has about 300 total items, and about 13 different features. I have tried training/testing my model in two ways and have had surprisingly very different outcomes:
1) Splitting my training and evaluation data table randomly from the original full data set:
let (classifierEvaluationTable, classifierTrainingTable) = classifierTable.randomSplit(by: 0.1, seed: 4)
I have played around a bit with the .1 split number and 4 seed number but the results are all over the place: Could be 33% or 80% evaluation accuracy in some cases. (I got 78% training accuracy, 83% validation accuracy, 75% evaluation accuracy in this case.)
2) I manually took 10 items from the original data set and put them into a new data set to test later. I then removed these items from the 300 item data set which was used for training. When I tested these 10 items, I got 96% evaluation accuracy. (I got 98% training accuracy, 71% validation accuracy, 96% evaluation accuracy in this case.)
I am wondering why is there such a big difference? Which reading should be seen as more realistic and credible? Is there anything I can do to either model to improve accuracy and credibility? ALSO: I am confused as to what the different accuracy measurements mean and how I should interpret them (training, validation, evaluation)?
Thanks.